Genomic sequence of Rhizobium sp. NGR 234 symbiotic plasmid

ABSTRACT

The sequencing and analysis of the complete nucleotide sequence of symbiotic plasmid pNGR234a isolated from Rhizobium sp. NGR234. The complete sequence of pNGR234a is presented. The analysis includes the identification of a number of novel ORFs and the proteins expressible therefrom which have been ascribed putative functions.

TECHNICAL FIELD

This invention relates to a symbiotic plasmid of the broad host-rangeRhizobium sp. NGR234 and its use. In particular, this invention relatesto the isolation and analysis of the complete sequence of the NGR234symbiotic plasmid pNGR234a, and the open reading frames (ORFs)identifiable therein as well as the proteins expressible from said ORFs.

BACKGROUND OF THE INVENTION

Together with carbon, hydrogen and oxygen, nitrogen is one of theessential components in organic chemistry. Although it is present invast quantities in the atmosphere, nitrogen in its diatomic form N₂remains unassimilable by living organisms. The nitrogen cycle begins bythe fixation of nitrogen into ammonia which is chemically more reactiveand can be assimilated into the food chain. A large fraction of thetotal nitrogen fixed every year is produced by microorganisms. Amongthese, the soil bacteria of the genera Azorhizobium, Bradyrhizobium,Sinorhizobium and Rhizobium, generally referred to as rhizobia, fixnitrogen in symbiotic associations with many plants from the Leguminosaefamily. This highly specific interaction leads to the formation ofspecialized root-, and in the case of Azorhizobium, stem-structurescalled nodules. It is within these nodules that rhizobia differentiateinto bacteroids capable of fixing atmospheric nitrogen into ammonia. Inturn, ammonia diffuses into the vegetal cells and sustains plant growtheven under limiting nitrogen conditions.

The Rhizobium-legume interaction presents many interesting features.Obviously, the possibility of using this symbiosis as an“environmentally friendly” way to provide some of the most importantworld crops (such as soybean, bean and many other legumes) with fixednitrogen without using nitrate-rich fertilizers, has important economicconsequences. It is also an ideal model to study a non-pathogenicinteraction between bacteria and a highly developed, multicellularorganism such as the host plant. Furthermore, the various steps involvedin the establishment of a functional nitrogen symbiosis, which includesome dramatic morphological changes as well as processes of cellulardifferentiation, require a complex exchange of molecular signals.Despite many decades of studies, it is only recently that theRhizobium-legume interaction has been partially understood at themolecular level. The establishment of a functional symbiosis can bedivided into two major steps as follows.

(A) Rhizosphere Ecology and Modulation

Rhizobia are soil bacteria that proliferate in the rhizosphere ofcompatible plants, taking advantage of the many compounds released byplant roots. In return it has been shown that the presence of rhizobiain the rhizosphere reduces susceptibility of plants to many rootdiseases. In the case of low nitrogen levels in the soil, compatiblerhizobia can interact with host plants and start the nodulation process(Long, 1989; Fellay et al., 1995; van Rhijn and Vanderleyden, 1995).Molecular signalling between the two partners begins with the release bythe plant of phenolic compounds (mostly flavonoids) that induce theexpression of nodulation genes (referred to as nod, nol and noe genes).The NodD1 gene product appears to be the central mediator between theplant signal and nodulation gene induction (Bender et al., 1988). It ismodified by the binding of flavonoids and acts as a positive regulatoron the expression of the remaining nodulation genes. Among them, thenodABC loci encode products responsible for the synthesis of the corestructure of lipooligosaccharides called Nod factors (Relić et al.,1994). More nodulation genes are involved in strain-specificmodifications of the Nod factors as well as in its secretion. It seemsestablished now that variability in the structure of Nod factors mayplay a significant role in the determination of the host-range of agiven Rhizobium strain, that is in its ability to efficiently nodulatedifferent legumes. For example, the strain Rhizobium meliloti can onlynodulate Medicago, Melilotus and Trigonella ssp., whereas Rhizobium sp.NGR234 can symbiotically interact with more than 105 different genera ofplants, including the non-legume Parasponia andersonii.

The structure of many Nod factors, their isolation from Rhizobiumstrains and their commercial application in agriculture have beendescribed (NodNGR-Faktoren: Relić et al., 1994; WO 94/00466;NodRm-Faktoren: WO 91/15496). Secreted Nod factors act in turn as signalmolecules that allow rhizobia to enter young root hairs of a host plant,and induce root-cortical cell division that will produce the futurenodule. Invaginated rhizobia progress towards the forming nodule withininfection threads that are synthesized by the plant cells. Bacteria arethen released into the cytoplasm of dividing nodule cells where theydifferentiate into bacteroids capable of fixing atmospheric nitrogen.

With respect to regulation of the nodulation genes, other regulatorygenes with similarities to nodD1 (genes that belong to the lysR family)have been identified in various strains (Davis and Johnston, 1990). Thefunction of these genes, called nodD2, nodD3 or syrM, is only partiallyunderstood. Some nodD genes have been described (WO 94/00466; CA1314249; WO 87/07910; U.S. Pat No. 5,023,180). Also, recombinant DNAmolecules including the consensus sequence of the promoters ofnodD1-regulated genes, called nod-boxes (Fisher and Long, 1993), havebeen disclosed (U.S. Pat. Nos. 5,484,718; 5,085,588). Finally,recombinant plasmids with the nodABC genes or, in one case(Bradyrhizobium japonicum), a sequence influencing host specificity havebeen disclosed (U.S. Pat. Nos. 5,045,461; 4,966,847).

(B) Symbiotic Nitrogen Fixation

Inside the nodules, rhizobia differentiate into bacteroids that expressthe enzymatic complex (nitrogenase) required for the reduction ofatmospheric nitrogen into ammonia. The nitrogenase is encoded by threegenes nifH, nifD and nifK which are well conserved in nitrogen fixingorganisms (Badenoch-Jones et al., 1989). Many additional loci arenecessary for functional nitrogenase activity. Those originallyidentified in Klebsiella pneumoniae are known as nif genes, whereasthose found only in Rhizobium strains are described as fix genes(Fischer, 1994). Some of these gene products are required for thebiosynthesis of cofactors, the assembly of the enzymatic complex or playregulatory and different accessory roles (oxygen-limited respiration,etc.). Many of these genes are less conserved among the variousrhizobial strains and in some cases their function is still not fullyunderstood. The high sensitivity of the nitrogenase complex to freeoxygen requires a very strict control of most nif and fix geneexpression. In this respect, the FixL, FixJ, FixK, NifA and RpoNproteins have been identified in representative Rhizobium species as themajor regulatory elements that, in microanaerobic conditions, activatethe synthesis of the nitrogenase complex (Fischer, 1994). RecombinantDNA molecules containing nif genes/promoters have been disclosed: nifHpromoters of B. japonicum (U.S. Pat. No. 5,008,194), nifH and nifDpromoter of R. japonicum (EP 164245), nifA of B. japonicum and R.meliloti (EP 339830), nifHDK and hydrogen-uptake (hup) genes of R.japonicum (EP 205071).

Many more genetic determinants play a significant role in theRhizobium-legume symbiosis. Genes (exo, lps and ndv genes) involved inthe production of extracellular polysaccharides (EPS),lipopolysaccharides (LPS) and cyclic glucanes of rhizobia play anessential role in the symbiotic interaction (Long et al., 1988;Stanfield et al., 1988). Mutation in these genes negatively influencesthe development of functional nodules. In this respect, someexopolysaccharides of the NGR234 derivative strain ANU280, have beendisclosed (WO 87/06796). Although Nod factors seem to play a key role inthe nodulation process, experimental data indicate that other signalmolecules produced by the bacterial symbionts are required forfunctional symbiosis and may play a role in coordinating various stepssuch as the controlled invasion process, the release of rhizobia fromthe infection thread into the plant cell cytoplasm, the bacteroiddifferentiation process, etc. Moreover, the need for rhizobia to survivein the rhizosphere and to compete adequately with other microorganismsrequires many more unidentified genes that, although they may not becharacterised as proper symbiotic loci, do affect the efficiency of thevarious strains to induce functional nitrogen fixing symbiosis in fieldconditions. Finally, in our view genetic engineering of improvedrhizobial strains cannot be pursued without a more extended knowledge ofthe structure and complexity of the Rhizobium symbiotic genome.

In this respect we decided to determine the complete DNA sequence of asymbiotic plasmid of Rhizobium sp. NGR234. In contrast to Bradyrhizobiumand Azorhizobium that carry symbiotic genes on large chromosomes (ca. 8Mbp) and to R. meliloti that harbours two very large symbiotic plasmidsof 1.4 and 1.6 Mbp, NGR234 carries a single plasmid of ca. 500 kbp,pNGR234a. Moreover, it has been shown by transfer of pNGR234a intoheterologous rhizobia, and even into non-nodulating Agrobacteriumtumefaciens, that most nodulation functions are encoded by this plasmid(Broughton et al., 1984). The fact that NGR234 is able to interactsymbiotically with more plants than any other known strain, and that acomplete ordered cosmid library of pNGR234a was available, reinforcedNGR234 as the best choice for a large-scale sequencing effort on asymbiotic plasmid (Perret et al., 1991; Freiberg et al., 1997).

Automated fluorescent methods have been used to sequence cosmids fromeukaryotic organisms, including Saccharomyces cerevisiae (Levy, 1994),Caenorhabditis elegans (Sulston et al., 1992), Drosophila melanogaster(Hartl and Palazzolo, 1993), and Homo sapiens (Bodmer, 1994), as well aschromosomes from the prokaryotes Haemophilus influenzae (Fleischmann etal., 1995) and Mycoplasma genitalium (Fraser et al., 1995). In mostlarge-scale sequencing centres this technology is based mainly on theshotgun approach. After random fragmentation of DNA (e.g. cosmids,bacterial artificial chromosomes (BACs), entire chromosomes) usingsonication or mechanical forces, size-selected fragments are subclonedinto M13 phages, phagemids or plasmids and sequenced by cycle sequencingusing dye primers (Craxton, 1993). A disadvantage of this method is thatDNA regions with elevated GC contents produce large numbers ofcompressions (unresolvable foci in sequence gels) in the dye primersequences leading to several hundred compressions per assembled cosmidsequence. It is known that the use of dye terminators—fluorescentlylabelled dideoxynucleoside triphosphates—instead of dye primers reducesthe number of compressions (Rosenthal and Charnock-Jones, 1993).Therefore, dye terminators are frequently being used for gap closure andproofreading after assembly of the shotgun data.

To sequence GC-rich cosmids with the highest accuracy, the effectivenessof shotgun sequencing with dye terminators in comparison to dye primersequencing was investigated. To improve the incorporation of dyeterminators into DNA, a modified Taq DNA polymerase carrying a singlemutation was used (Tabor and Richardson, 1995). This enzyme hasproperties similar to a thermostable “sequenase” and is commerciallyavailable as Thermo Sequenase (Amersham, Buckinghamshire, UK) orAmpliTaq FS (Perkin-Elmer, Foster City, Calif., USA). Concentrations ofdye terminators needed in the cycle sequencing reactions can be reducedby 20-250 times. It was found that dye terminator shotgun sequencingleads to compression-free raw data that can be assembled much fasterthan shotgun data mainly obtained by dye primer sequencing. Thisstrategy thus allows a several-fold increase in speed to sequenceindividual cosmids. This was demonstrated by comparing assembly of thesequence data of two cosmids from pNGR234a generated by differentchemistries: Cosmid pXB296 was sequenced with dye terminators, whereasdata for pXB110 were obtained using the common dye primer method. Alsodisclosed is the analysis of the entire pXB296 sequence.

Moreover, the dye terminator shotgun sequencing strategy used togenerate the sequence data for pXB296 was also used to sequence all theother remaining overlapping cosmids of the plasmid pNGR234a. In summary,20 cosmids have been sequenced together with two PCR products and asubcloned DNA fragment derived from a cosmid identified as pXB564 inorder to generate the plasmid's complete nucleotide sequence.

After its assembly, the analysis of the entire nucleotide sequence ofpNGR234a, especially the determination of putative coding regions andthe prediction of their expressible proteins and putative functions, wasperformed. Initially, analysis of the region covered by cosmid pXB296was extended to cosmids pXB368 and pXB110. Thus, in approximately 100 kbof the plasmid (position 417,796-517,279) most ORFs and their deducedproteins with different putative functions were predicted. Subsequently,the rest of pNGR234a was analyzed.

SUMMARY OF THE INVENTION

The present invention provides the complete nucleotide sequence ofsymbiotic plasmid pNGR234a or degenerate variants thereof of Rhizobiumsp. NGR234.

The present invention also contemplates sequence variants of the plasmidpNGR234a altered by mutation, deletion or insertion.

Also encompassed by the present invention are each of the ORFs derivablefrom the nucleotide sequence of pNGR234a or variants thereof.

In a preferred embodiment, the ORFs derived from the nucleotide sequenceof pNGR234a encode the functions of nitrogen fixation, nodulation,transportation, permeation, synthesis and modification of surface poly-or oligosaccharides, lipo-oligosaccharides or secreted oligosaccharidederivatives, secretion (of proteins or other biomolecules),transcriptional regulation or DNA-binding, peptidolysis or proteolysis,transposition or integration, plasmid stability, plasmid replication orconjugal plasmid transfer, stress response (such as heat shock, coldshock or osmotic shock), chemotaxis, electron transfer, synthesis ofisoprenoid compounds, synthesis of cell wall components, rhizopinemetabolism, synthesis and utilization of amino acids, rhizopines, aminoacid derivatives or other biomolecules, degradation of xenobioticcompounds, or encode proteins exhibiting similarities to proteins ofamino acid metabolism or related ORFs, or enzymes (such asoxidoreductase, transferase, hydrolase, lyase, isomerase or ligase).

In another preferred embodiment, the ORFs are under the control of theirnatural regulatory elements or under the control of analogues to suchnatural regulatory elements.

The present invention also provides the sequences of the intergenicregions of pNGR234a which, in a preferred embodiment, are regulatory DNAsequences or repeated elements. In a further preferred embodiment, theintergenic sequences are ORF-fragments.

Also provided by the present invention are mobile elements (insertionelements or mosaic elements) derivable from the nucleotide sequences ofthe present invention.

The present invention also contemplates the use of the disclosednucleotide sequences or ORFs in the analysis of genome structure,organisation or dynamics.

Also provided by the present invention is the use of the nucleotidesequences or ORFs in the subcloning of new nucleotide sequences. In apreferred embodiment, the new nucleotide sequences are coding sequencesor non-coding sequences.

In yet a further preferred embodiment, the nucleotide sequences or ORFsare used in genome analysis and subcloning methods as oligonucleotideprimers or hybridization probes.

The present invention further provides proteins expressible from thedisclosed nucleotide sequences or ORFs.

Also contemplated by the present invention is the use of the disclosednucleotide sequences, individual ORFs or groups of ORFs or the proteinsexpressible therefrom in the identification and classification oforganisms and their genetic information, the identification andcharacterisation of nucleotide sequences, the identification andcharacterisation of amino acid sequences or proteins, the transportationof compounds to and from an organism which is host to said nucleotidesequences, ORFs or proteins, the degradation and/or metabolism oforganic, inorganic, natural or xenobiotic substances in a host organism,or the modification of the host-range, nitrogen fixation abilities,fitness or competitiveness of organisms.

The present invention also provides plasmid pNGR234a of Rhizobium sp.NGR234 comprising the disclosed nucleotide sequence or any degeneratevariant thereof.

The present invention also provides a plasmid harbouring at least one ofthe disclosed ORFs or any degenerate variant thereof.

The plasmids of the invention may be produced recombinantly and/or bymutation, deletion, insertion or inactivation of an ORF, ORFs or groupsof ORFs.

The present invention also provides the use of the disclosed plasmids orvariants thereof in obtaining a synthetic minimal set of ORFs requiredfor functional Rhizobium-legume symbiosis, the modification of thehost-range of rhizobia, the augmentation of the fitness orcompetitiveness of Rhizobium sp. NGR234 in the soil and its nodulationefficiency on host plants, the introduction of desired phenotypes intohost plants using the disclosed plasmids as stable shuttle systems forforeign DNA encoding said desired phenotypes, or the direct transfer ofthe disclosed plasmids into rhizobia or other microorganisms withoutusing other vectors for mobilization.

The nucleotide sequences of the present invention were advantageouslyobtained using known cycle sequencing methods. The preferred dyeterminator/thermostable sequenase shotgun sequencing method used togenerate the nucleotide sequences of the present invention, when appliedto cosmids and when compared to other sequencing methods, was shown toyield sequence reads of the highest fidelity. Consequently, the speed ofassembly of particular cosmids was increased, and the resultanthigh-quality sequences required little editing or proofreading. Thus,the preferred sequencing method described herein was successfully usedto generate the complete nucleotide sequence of all the overlappingcosmids of plasmid pNGR234a, thereby resulting in the assembly of thecomplete sequence of the plasmid.

The complete sequence of pNGR234a is disclosed for the first time inthis application, as are the majority of the ORFs predicted within thesequence. Putative functions have been ascribed to the novel andinventive ORFs disclosed herein and the proteins for which they code.

BRIEF DESCRIPTION OF DRAWINGS

The present invention is described below and illustrated thereafter inthe appended examples, with reference to the following figures:

FIG. 1 A comparative graph showing the comparison of sequences frompXB296 created by different cycle sequencing methods.

FIG. 2 A schematic diagram showing the organization of the predictedORFs in pXB296 from Rhizobium sp. NGR234.

FIG. 3 The complete nucleotide sequence of plasmid pNGR234a (with thepages labelled sequentially from 19961 to 1996142).

FIG. 4 A schematic diagram showing the map of the 20 sequenced cosmidscovering the 536 kb symbiotic plasmid pNGR234a of Rhizobium sp. NGR234.

FIG. 5 A diagram indicating multiple alignments of the nucleotidesequence of the replication origins of various plasmids.

FIG. 6 A diagram indicating multiple DNA sequence alignments of theregions containing the origin of transfer of various plasmids.

FIG. 7 A schematic diagram showing a circular representation of thesymbiotic plasmid pNGR234a of NGR234.

DETAILED DESCRIPTION OF THE INVENTION AND BEST MODE

Comparison of Different Shotgun Sequencing Strategies

The following is a more detailed description of certain key aspects ofthe present invention.

GC-rich cosmids were examined to investigate whether they could besequenced much more efficiently using dye terminators throughout theshotgun phase instead of dye primers. As a test case, cosmid pXB296 witha GC content of 58 mol % from pNGR234a, the symbiotic plasmid ofRhizobium sp. NGR234, was exclusively sequenced using dye terminators incombination with a thermostable sequenase [Thermo Sequenase (Amersham)].Another rhizobial cosmid with identical GC content, pXB110, wassequenced using traditional dye primer chemistry and Taq DNA polymerase.

Using the dye terminator/thermostable sequenase shotgun strategy, it wasshown that most, if not all, compressions could be resolved and readswere produced with the highest fidelity among all sequencing chemistriestested. As a result, a much faster assembly of cosmid pXB296 incomparison to pXB110 was obtained. The shotgun data could be assembledinto a high-quality sequence without extensive editing and proofreading.By measuring the error rate in overlapping regions between individualcosmids from pNGR234a, as well as the cosmid vector sequence itself(data not shown), it was estimated that the accuracy of the pXB296sequence is higher than 99.98%. Using other thermostable sequenases suchas AmpliTaq FS (Perkin-Elmer), similar results were expected becausethermostable sequenases have similar properties.

Dye primer chemistry in combination with Thermo Sequenase was alsoexamined. Although the peak uniformity of signals was much improved overdye primer/Tag DNA polymerase data, the number of compressions inGC-rich shotgun reads was not reduced significantly. Compressions inshotgun raw data enormously increase the overall effort of editing,proofreading, and finishing a cosmid as shown for pXB110 (Table 1).

Because of their longer reading potential, dye primer reads are helpfulfor gap closure. However, using ABI 373A sequencers (Applied Biosystems,Inc. (ABI), Perkin-Elmer, Foster City, Calif., USA), dye primer readsare, on average, only ˜50 bases longer than dye terminator reads.

Using the experimental conditions of the present invention, shotgunsequencing with dye terminators and a thermostable sequenase is superiorbecause for GC-rich cosmid templates it removes most of the compressionsand this leads to a several-fold improvement in assembling and finishingof cosmid-sized projects. Although dye terminators are slightly moreexpensive than dye primers, the overall saving in time for finishingprojects has, in our experience, a much greater effect on general costs.

It has been shown that the strategy of the present invention iseffective for high-throughput shotgun sequencing of GC-rich templates.This strategy was therefore used to sequence the remaining 19overlapping cosmids of the symbiotic plasmid pNGR234a of Rhizobium sp.NGR234. In total, 20 cosmids, two PCR products (1.5 and

TABLE 1 Comparison of the assembly of the sequence data from cosmidspXB296 (dye terminator shotgun reads) and pXB110 (dye primer shotgunreads) Data assembly pXB296 pXB110 Average length of the shotgun reads(bases) 332 378 No. of shotgun reads used for assembly 786 899 No. ofshotgun reads assembled with 4% 736 308 mismatch^(a) No. of shotgunreads assembled with 25% 775 879 mismatch^(a) No. of contigs^(b) longerthan 1 kbp 3 25 No. of contigs left after editing^(c) 2 4 No. ofadditional reads (gap closure and 32 191 proofreading)^(d) Total lengthof cosmid insert (bp) 34,010 34,573 Sequencing redundancy (per bp) 8.010.5 ^(a)Assembling program: XGAP; principal autoassembling conditions:normal shotgun assembly, joins permitted, minimum initial match = 15,maximum no. of pads per reading during the alignment procedure = 8,maximum no. of pads per reading in contig to align any new reading = 8,alignment mismatches 4% and 25%, respectively. ^(b)Contiguous parts ofsequence created by overlapping reads. ^(c)Lengths of contigs: 6-10 kbp(pXB296); 2-12 kbp (pXB110). ^(d)Reads necessary for closing gaps andmaking single-stranded regions double-stranded by primer walking onselected templates and, in case of pXB110, for solving ambiguities(compressions) by the resequencing of clones with universal primer anddye terminators.

2.0 kb in length) and a 1.5 kb restriction fragment were sequenced inorder to generate the complete pNGR234a sequence (FIG. 4).

Genetic Organization of pXB296

All 28 predicted open reading frames (ORFs) in pXB296 (FIG. 2) showsignificant homologies to database entries (Table 2). The first putativegene cluster (cluster I) containing ORF1 to ORF5 corresponds to variousoligopeptide permease operons (Hiles et al., 1987; Perego et al., 1990).Only ORF5 shows homology to a gene from a different bacterium, Bacillusanthracis (Makino et al., 1989). Each homologue encodes membrane-boundor membrane-associated proteins suggesting that all five ORFs areinvolved in oligopeptide permeation.

Organization of the predicted gene cluster IV, including the nifAhomologue ORF16 (fixABCX, nifA, nifB, fdxN, ORF, fixU homologues,position 16,746-24,731), the predicted locations of the σ⁵⁴-dependentpromoters and the nifA upstream activator sequences (FIG. 2), correspondto the organization found in Rhizobium meliloti and Rhizobiumleguminosarum bv. trifolii. (Iismaa et al., 1989; Fischer, 1994). NifAis a positive transcriptional activator (Buikema et al., 1985), whereasnif and fix genes are essential for symbiotic nitrogen fixation.Identification of σ⁵⁴-dependent promoter sequences, together with theupstream activator motifs upstream of ORF21, ORF22, and ORF23, suggeststhat these ORFs may play an important, but still undefined, role insymbiosis.

Inevitably, large-scale sequencing uncovers differences with alreadypublished sequences. van Slooten et al. (1992) cloned a 5.8 kb EcoRIfragment from Rhizobium sp. NGR234 and sequenced 2067 bp by manualradioactive methods (EMBL accesion no. S38912). This sequence exhibits2.4% mismatches with the corresponding sequence in pXB296.

TABLE 2 Putative ORFs of pXB296 and homologies of the deduced amino acidsequences to known proteins ribosomal binding site: SD-sequence -distance from start codon (bases)- no. of position on start codon^(d)deduced homologous homologous protein iden- simi- cosmid SD-Sequence:5′- amino amino acids length acces- tity larity ORF^(a) st.^(b) (baseno.)^(c) TAAGGAGGTGA-3′ acids (position) name (aa)^(e) function^(f) sionno. (%)^(g) (%)^(g) ORF1^(h) + 00001-00625 >207 1-207 OppB 306oligopeptide X05491 45 68 ORF2 + 00628-01503 GTATCCGGT-7-ATG 291 2-289OppC 305 permease X56347 37 63 ORF3 + 01505-02512 AGCGGAGG-7-ATG 3358-327 OppD 336 proteins X56347 49 69 ORF4 + 02509-03570 TGAAGTGGT-6-ATG353 2-323 OppF 334 X05491 51 69 ORF5 + 03606-04991 CAAGGA-6-ATG 4611-458 CapA 411 encapsulation M24150 25 48 protein ORF6 + 05460-06863CCGAGAGG-8-ATG 467 1-464 BioA 455 aminotransferase M29292 29 55 ORF7 +06888-08426 GCCTTCGG-5-GTG 512 97-509  ORF^(i) 417 unknown D37877 36 5834-510  GapD 482 succinic M38417 33 57 semialdehyde dehydrogenase ORF8 −09781-10860 GAACGTGG-8-ATG 359 72-299  ORF^(i) 414 transposase X15942 3048 homologue minicircle DNA ORF9 + 11124-12455 ?-7-ATG 443 2-443 GLUD1558 glutamate M37154 41 60 dehydrogenase ORF10 − 13370-14116AAAGGA-6-ATG 248 1-245 ORF2^(i) 231 transposase X79443 45 64 ORF11 −14128-15672 CATGGAG-7-TTG 514 1-513 ORF1^(i) 558 homologues, X79443 4162 IS1162 ORF12 − 16712-16942 GAAGGA-8-ATG 76 1-70  FixU 70 unknownP42710 63 80 ORF13 − 16939-17265 ACAAGAGG-7-ATG 109 1-79  ORF2^(i) >78unknown X07567 53 81 15-107  NifZ 159 involved in M20568 39 56FeMo-cofactor synthesis ORF14 − 17349-17543 CCAGGAG-9-ATG 64 1-64  FdxN64 ferredoxin-like M21841 80 87 ORF15 − 17585-19066 AGTGGAG-7-ATG 4931-493 NifB 490 involved in M15544 73 84 FeMo-cofactor synthesis ORF16 −19292-20962 ATTGG-12-ATG 556 9-556 NifA 541 transcriptional X02615 59 72regulator ORF17 − 21129-21422 AGGGGAG-7-ATG 97 1-97  FixX 98 requiredfor M15546 84 87 ORF18 − 21437-22744 AACTGAGGT-7-ATG 435 1-435 FixC 435nitrogen M15546 83 90 ORF19 − 22755-23864 ATAGGAG-6-ATG 369 18-369  FixB353 fixation M15546 79 89 ORF20 − 23874-24731 TAAAGAG-5-ATG 285 1-285FixA 292 M15546 74 85 ORF21 − 25148-25468 CCAGGAG-10-ATG 106 1-106ORF118^(i) 108 unknown X13691 55 71 ORF22 − 26145-26711 GAAGGAG-9-ATG188 9-199 — 241 hypothetical U32739 47 64 protein 1-173 — 166peroxisomal U11244 32 57 protein ORF23 + 27169-27861 GAAGGA-7-ATG 2301-167 NifQ 167 probably involved X13303 37 57 in Mo-processing ORF24 +27920-29434 CTGGGAGG-18-ATG 504 1-454 DctA1 456 C₄-dicarboxylate S3891297 98 8-454 DctA2 449 transporter S38912 97 98 ORF25 + 29431-30675TTCGGCGG-12-ATG 414 2-414 CamC 415 cytP450-like M12546 34 53 ORF26 +30676-31332 TTGGG-5-TTG 218 30-190  LinA 155 γ-hexachloro- D90355 27 51cyclohexan- dechlorinase ORF27 + 31329-33035 AGTGGAG-10-ATG 568 28-270 FabG 244 reductase M84991 38 57 294-534  30 57 ORF28^(k) + 33173-34010CAAGGAG-5-ATG >279 1-279 LuxA 355 luciferase M10961 23 49 α-subunit^(a)(ORF) Open reading frame. ^(b)(st.) Plus or minus strand.^(c)Position on cosmid: from the first base of the start codon to thelast base of the stop codon; alternative start points are 6912/6927/7017(ORF7), 10665/10656 (ORF8), 11220 (ORF9), 15699/15651 (ORF11),17322/17271 (ORF13), 20995/21076 (ORF16), 26744 (ORF22), 27229/27304(ORF23), 27941 (ORF24), and 30751/30754 (ORF26). ^(d)(SD sequence)Shine-Dalgamo sequence (Shine and Dalgamo 1974). Bases underlined areidentical with the Shine-Dalgamo sequence. The following possible startcodons were considered: ATG, GTG, or TTG. ^(e)(aa) Amino acids.^(f)Organisms: Salmonella typhimurium, Bacillus subtilis (OppBCDF),Bacillus anthracis (CapA), Bacillus sphaericus (BioA), Streptomyceshygroscopicus (ORF7 homolog), Escherichia coli (GapD), Streptomycescoelicolor (ORF8 homolog), Homo sapiens (GLUD1), Pseudomonas fluorescens(ORF10, ORF11 homologs), Rhizobium leguminosarum (FixU), Rhodobactercapsulatus (ORF13 homolog), Azotobacter vinelandii (NifZ), Rhizobium #meliloti (FdxN, NifBA, FixXCBA), Bradyrhizobium japonicum (ORF118),Haemophilus influenzae (hypothetical protein), Lipomyces kononenkoae(peroxisomal protein), Klebsiella pneumoniae (NifQ), Rhizobium sp.NGR234 (DctA), Pseudomonas putida (CamC), Pseudomonas paucimobilis(LinA), Escherichia coli (FabG), Vibrio harveyi (LuxA). ^(g)Identity andsimilarity were calculated using the program BESTFIT (Smith and Waterman1981). ^(h)(ORF1) 3′ end. ^(i)Translated ORF. ^(k)(ORF28) 5′ end.

It contains the gene dctA (encoding a C₄-dicarboxylate permease), whichis 144 bases shorter than in pXB296. In this respect, a singlenucleotide deletion in position 29,248 of the cosmid sequence close tothe 3′ end of the gene causes a frameshift leading to a DctA productextended by 48 residues. van Slooten et al. (1992) also failed toidentify the nifQ homologue, ORF23 (position 27,169-27,861), presumablybecause they overlooked a small XhoI fragment located between positions27,349 and 27,536 on pXB296. Expression studies allowed theseinvestigators to define a putative σ⁵⁴-dependent promoter in a 1.7 kbSmaI fragment (position 27,094-28,818 in pXB296). This fragmentstretches from the upstream region of ORF23 to the 5′ part of dctA. The58 bp intergenic region between ORF23 and dcta contains a stem-loopstructure but no obvious promoter sequence. Possibly the promoter thatcontrols dctA is located upstream of ORF23 (e.g. the minimal consensussequence included in GGGGGCACAATTGC at position 27,098-27,111). Althoughclones containing dctA complemented mutants of R. meliloti and R.leguminosarum for growth on dicarboxylates, the growth of the NGR234dctA deletion mutant was not affected (van Slooten et al., 1992).Nevertheless, this mutant was unable to fix nitrogen in nodules. BecausedctA is now possibly part of a larger transcription unit, the symbioticphenotype may also result from the inactivation of downstream genes.

Interestingly, the GC content of the predicted pXB296 ORFs ranges from53.3 mol % to 64.6 mol %, with an overall cosmid GC content of 58.5 mol%. Genomes of Azorhizobium, Bradyrhizobium, and Rhizobium species haveGC contents of 59 mol % to 65 mol % (Padmanabhan et al., 1990), with 62mol % reported for Rhizobium sp. NGR234 (Broughton et al., 1972).Although pXB296 covers <7% of the complete symbiotic plasmid sequence,its lower overall GC value suggests that symbiotic genes might haveevolved by lateral transfer from other organisms. In this case, methodsof the type applied in the present invention will become even morerelevant in sequencing the whole genome.

Genetic Organization of the 100 kb Region Covered by Cosmids pXB296,pXB368 and pXB110

Extending the analysis of pXB296 to a 100 kb region stretching fromposition 417,796 to 517,279 on the symbiotic plasmid pNGR234a ledinitially to the assignation of only 76 ORFs listed within Table 3(excluding the first incomplete ORF noted in the analysis of pXB296(“ORF1” of Table 2)). The ORFs y4tQ to y4vJ (excluding ORFs y4uD andy4uG and excluding ORF-fragments fu1, fu2, fu3, fu4 and fv1; see Table3) are identical to the ORFs 2 to 28 of the analysis of pXB296 in Table2 apart from minor revisions (N.B. the analysis recited in Table 3should be taken as the definitive analysis—Table 2 merely representspreliminary findings). The cosmid pXB110, which was sequenced with thedye primer shotgun sequencing strategy in order to compare it with thedye terminator shotgun sequencing strategy used to sequence cosmidpXB296, in combination with pXB296 and pXB368 cover nearly this entireregion. A PCR product and a restriction fragment of cosmid pXB564 alsohad to be sequenced in order to fill in the gap from position 480,607 to483,991 between cosmids pXB368 and pXB110 (FIG. 4). Among the 76predicted ORFs, 7 ORFs and their deduced proteins show no homologies todatabase entries. The other predicted ORFs and their deduced proteins doexhibit such homologies and therefore play putative roles in nitrogenfixation (ORFs y4uJ to y4vB, y4vE, y4vN to y4vR, y4wK and y4wL),nodulation (ORFs y4yC and y4yH), transportation (ORFs y4tQ to y4uA, y4vFand y4wM), secretion of proteins or other biomolecules (ORFs y4yI andy4yO), transcriptional regulation/DNA binding (ORFs y4wC and y4xI), inamino acid metabolism or metabolism of amino acid derivatives (ORFsy4uB, y4uC, y4uF, y4wD, y4wE and y4xN to y4yA), degradation ofxenobiotic compounds (ORFs y4vG to y4vI), in peptidolysis/proteolysis(ORFs y4wA and y4wB) or transposition (ORFs y4uE, y4uH and y4uI) (seeTable 3). The

role of some ORFs like the luciferase-like ORFs (y4vJ and y4wF; seeTable 3) in rhizobia is still not clear. In the 100 kb region, theduplication of a 5 kb sequence (position 451,886 to 456,157 and 483,764to 488,035) including the genes nifHDK is remarkable. These genes encodethe basic subunits of the nitrogenase. Furthermore, the transcriptionalregulator nodD2 is very interesting because its role seems not to beidentical to a previously identified nodD2 in a closely related strain(Appelbaum et al., 1988; data not shown). Also the pmrA-homologous ORFy4xI putatively plays an important role in regulating symbioticprocesses because a nod box (binding region for the basic regulatornodD1; Fisher and Long, 1993) is located upstream of this ORF (position493,962 to 494,000). Finally, the presence of ORFs (y4yI and y4yK toy4yN; see Table 3) homologous to type III secretion proteins, which haveonly been known previously in plant or animal/human pathogenic bacteria,shows that there only seems to be a subtle difference between symbioticand pathogenic abilities of microorganisms.

TABLE 3 List of the predicted functional ORFs and of fragmentsrepresenting putative remnants of functionial ORFs no. of hom. func-position in deduced amino hom. protein tional plasmid amino acids lengthaccession I/ S/ ORF^(a) name st.^(b) (base no.)^(c) acids (position)name (aa)^(d) no.^(e) %^(f) %^(f) note^(g) y4aA −2/3 534696-000474 64716-646 Shc 658 X86552 78 88 prob. squalene-hopene-cyclase; put. operony4aABCD: inv. in synthesis of an isoprenoid compound y4aB −3000523-001776 417 6-415 ORF1 414 X80766 43 63 put. flavoproteinoxidoreductase y4aC −2 001776-002615 279 3-247 Psy1 419 X68017 34 50put. phytoene synthase y4aD −1 002612-003490 292 10-195 Crt1 342 L3740533 51 hyp. protein hom. to squalene and phyto- fa1 −3 003487-004011 enesynthetases fragmentous character y4aF nolK −3 005173-006117 314 9-310ORF14.8 321 U46859 51 70 put. NAD-dep. nucleotide sugar epimerase/dehydrogenase; NoeJKL/NodZ/NolK inv. in biosynthesis of fucose moiety ofNod factors y4aG noeH −2 006126-007181 351 4-339 RfbD 348 U24571 65 80put. GDP-D-mannose dehydratase y4aH nodZ −1 007426-008394 322 3-254 NodZ324 L22756 69 83 put. fucosyltransferase y4aI noeK −3 008623-010047 4745-471 ORF5 483 U47057 42 59 put. phosphomannomutase y4aJ noeJ +3010110-011648 512 33-498 XanB 466 M83231 50 65 put. mannose-1-phosphateguanylyl- transferase y4aK +2 012125-012277 50 hyp. 5.5 kd protein y4aLnodD1 +2 012380-013348 322 1-322 NodD1 322 Y00059 98 99 transcriptionalregulator (LysR family); 1-310 NodD2 312 this work 68 84 high similarityto Y4xH(NodD2) y4aM +3 013911-014342 143 7-132 ORF3 127 L13845 50 66put. DNA-binding protein; high 1-143 Y4wC 143 this work 69 77 similarityto Y4wC y4aN +1 014488-014934 148 1-129 ORF3 128 X04833 41 56 homologuelocated nearby the replicator region of pRiA4b y4aO +3 015065-015643 192hyp. 21.8 kd protein; low similarity to Y4nF(<30% id.) y4aP mucR +3016161-016592 143 1-143 MucR 143 L37353 89 95 put. transcriptionalregulator (Ros/MucR family); similarity to Y4pD; possibly inv. inregulation of exopolysaccharide synthesis y4aQ −2 017016-017582 18815-167 No1265 266 X74068 33 50 hyp. 20.4 kd protein; similar to Y4hP,Y4jD, Y4qI y4aR +2 017798-018121 107 hyp. 12.1 kd protein y4aS +1018121-018666 181 hyp. 20 kd protein fa2 +3 018912-019664 250 126-250Tnp 465 U04047 38 51 hyp. protein fragment 78-150 Y4iG 90 this work 9397 3-266 Y4bF 457 this work 53 73 y4bA −2 019674- 021758 694 1-393 fo6430 this work 89 95 hyp. 78.7 kd protein; identical to Y4pH 406-532 fo5136 this work 83 94 532-694 fo4 143 this work 77 83 y4bB −3021748-022014 88 2-88 Y4oL 88 this work 63 69 hyp. 9.7 kd proteinprecurser; identical to Y4pI y4bC −1 022034-022483 149 1-149 Y4oM 149this work 79 88 hyp. 16.8 kd protein; identical to Y4pJ y4bD −2022674-022943 89 20-89 Y4oN 70 this work 73 84 hyp. 10.2 kd protein;identical to Y4PK fb1 +2 022985-023659 224 36-224 Y4bF 457 this work 4263 hyp. protein fragment y4bF +1 023953-025326 457 130-436 Tnp 465U04047 31 46 put. transposase; 2-265 Fa2 266 this work 53 73 upstream ofthis ORF (23875- 77-169 Y4iG 90 this work 51 72 23987) 89% nt-id. topart 285-457 Fb1 188 this work 42 63 of origin of replication-region410-457 Y4JM 70 this work 75 79 (R. meliloti;, S66221) y4bG +1025870-026685 271 hyp. 30 kd protein precurser y4bH +1 028513-028788 91hyp. 9.6 kd integral membrane protein y4bI +3 028860-029276 138 3-108HI1631 190 U00085 41 61 hyp. 15.3 kd protein precurser y4bJ +1029392-031284 630 429-564 HtrA 503 L20127 40 53 hyp. 67.9 kd integralmembrane protein, distantly related to peptidase family S2C y4bK +2031625-032293 222 83-212 ORF1 215 D84146 25 45 hyp. 24.3 kd protein y4bL+1 032641-034191 516 7-515 ORF1 558 X79443 44 63 identical to Y4kJ andY4tB; similar to Fo3 and Fo7; put. transposase 6-516 Y4ul 515 thiswork48 66 y4bM +3 034188-034979 263 1-203 ORF2 231 X79443 45 62 identical toY4kI and Y4tA; put. 6-248 Y4pL 245 this work 55 73 insertion sequenceATP- binding 6-254 Y4uH 248 this work 48 68 protein; similarity to Y4pL,Y4uH, 1-263 Y4iQ 298 this work 31 56 also to Y4sD/Y4nD/Y4iQ y4bN +1035278-036573 431 hyp. 47.6 kd protein y4bO +1 036646-038466 606 hyp.66.8 kd protein y4cA −1 038576-042169 1197 hyp. 137.7 kd protein;largest protein in pNGR234a y4cB −3 042226-042522 98 hyp. 10.2 kdintegral membrane protein y4cC −3 042556-044109 517 hyp. 57.8 kd proteiny4cD −2 044106-046028 640 hyp. 71.6 kd protein y4cE −3 046486-047661 391hyp. 43.4 kd protein y4cF −1 047687-048829 380 hyp. 41.8 kd protein y4cG+2 049361-050278 305 16-173 Pin 184 K00676 50 68 prob. DNA invertase“resolvase-type” 17-222 Y4IS 183 this work 40 60 Y4cH −2 050427-05063669 4-65 CspS 70 L23115 56 70 prob. cold shock regulator y4cI −2053202-054416 404 1-397 RepC 405 X04833 60 73 put. replication protein Cy4cJ −3 054571-055551 326 1-317 RepB 319 X89447 39 55 put. replicationprotein B y4cK −2 055608-056831 407 10-404 RepA 398 X89447 58 73 put.replication protein A y4cL tra1 +2 057635-058261 208 1-206 TraI 212U43675 55 66 prob. autoinducer synthetase (inv. in control of conjugaltransfer) y4cM trbB +3 058272-059249 325 3-325 TrbB 323 U43675 80 88prob. conjugal transfer protein 1-115 Y4oG 125 this work 25 51 (PulEfamily) y4cN trbC +1 059239-059622 127 7-127 TrbC 134 U43675 69 78 prob.conjugal transfer protein (integral membrane prot.) y4cO trbD +2059615-059914 99 1-99 TrbD 99 U43675 70 89 prob. conjugaltransferprotein (integral membrane prot.) y4cP trbEa +3 059925-060374149 1-136 TrbE 820 U43675 80 91 prob. conjugal transfer protein (hom. to5′ part of trbE) y4cQ trbEb +1 060394-062382 662 5-659 TrbE 820 U4367583 90 prob. conjugal transfer protein (hom. to 3′ part of trbE) y4dAtrbJ +2 062354-063157 267 1-107 TrbJ 175 U43675 60 69 prob. conjugaltransfer protein 194-267 71 79 y4dB trbK +1 063154-063351 65 5-65 TrbK75 U43675 40 56 prob. conjugal transfer protein precurser y4dC trbL +3063345-064520 391 3-387 TrbL 395 U43675 74 85 prob. conjugal transferprotein (integral membrane prot.) y4dD trbF +2 064544-065206 220 1-220TrbF 220 U43675 80 90 prob. conjugal transfer protein y4dE trbG +1065224-066036 270 6-270 TrbG 284 U43675 74 84 prob. conjugal transferprotein precurser y4dF trbH +1 066040-066486 148 1-147 TrbH 159 U4367555 68 prob. conjugal transfer protein precurser (with lipid anchor) y4dGtrbI +3 066498-067793 431 1.430 TrbI 433 U43675 66 79 prob. conjugaltransfer protein (integral membrane prot.) y4dH traR +2 068096-68806 2367-236 TraR 234 Z15003 28 45 prob. transriptional activator of conju- galtransfer genes (LuxR family) y4dI traM −1 068810-069133 107 8-101 TraM102 U43674 30 51 prob. modulator of TraR/autoinducer- mediatedactivation of tra genes y4dJ +3 069351-069584 77 1-67 ORF 84 X16458 3759 hyp. transcriptional regulator (PbsX family); low similarity toN-terminus of Y4dL y4dK −1 069629-069949 106 hyp. 11.8 kd protein fd1 −2069936-070250 (105) (2-85) ORFA 400 X67861 39 58 put. transposasefragment y4dL +1 070603-071193 196 — — — — — — hyp. 21.8 kd protein; lowsimilarity to Y4dJ y4dM +2 071186-072415 409 1-357 HipA 440 M61242 31 46hyp. 45.3 kd protein; homolog affects frequency 3-405 Y4mE 420 this work34 56 of persistence after inhibition of cell wall or DNA synthesis y4dN+1 072787-072975 62 — — — — — — hyp. 7 kd protein y4dO −1 073550-073951133 12-121 ORF 38.1 D83536 43 57 hyp. 14.9 kd (fragmentous?) protein;homology to intron protein of P. anserina continues in fr.-2(73541-73467) y4dP −1 074423-075025 200 1-48 ORFR2 57 U43674 72 89 hyp.21 kd protein; hom. to 56-198 ORFR3 154 47 71 conjugal transfer region 1y4dQ traB −2 075042-076205 387 1−387 TraB 421 U40389 61 72 prob.conjugal transfer protein y4dR traF −3 076195-076761 188 20-188 TraF 176U40389 55 73 prob. conjugal transfer protein y4dS traA −2 076758-0800661102 1-1102 TraA 1100 U43674 67 79 prob. conjugal transfer protein(relaxase) y4dT traC +3 080319-080627 102 1-102 TraC 98 U40389 64 80prob. conjugal transfer protein y4dU traD +1 080632-080847 71 1-71 TraD71 U43674 77 84 prob. conjugal transfer protein y4dV traG +2080834-082756 640 1-631 TraG 658 U40389 71 83 prob. conjugal transferprotein fd2 + 083002-083293 ORFL1 152 U43674 fragments hom. to ORFL 1(conjugal transfer region 1); frameshifts: 83072 (1 > 3), 83161 (3 > 2)y4dW +1 083305-083919 204 hypothetical 22.9 kd protein y4dX +1083944-84522 192 hypothetical 20.6 kd protein ydeA −2 084570-084836 88hypothetical 9.9 kd protein ydeB −2 084976-085290 104 hypothetical 11.6kd protein fe1 − 085829-088007 MerA 474 X65467 put. fragments; homologyto mercuric reductase, put. frameshifts: 86592 (−1<−3), 87288 (−3<−2)y4eC −2 088305-089228 307 14−306 TraC-1 1061 X59793 38 55 hyp. 34.2 kdprotein; hom. to 5′ end. of traC-1 from plasmid RP4 y4eD +1091051-092178 375 51-136 ORF145 145 X52594 29 55 put. phosphodiesterase;low homology to glycerophosphoryl- diester-phosphodiesterase y4eE +1092212-093288 358 hyp. 38.5 kd protein fe2 − 093572-093969 TrpA U14952(fragments of put. transposase; put. frameshift; 93798 (2>3) y4eF −1093980-094735 251 2-236 Int 259 U14952 37 53 put. integrase/recombinase1-251 Y4qK 308 this work 92 94 (”phage-type”); similar to Y4rF (35%aa-id.); low similarity to Y4rABCDE fe5 −1 094988-095188 66 1-66 Fq6 66this work 79 94 put. defective integrase/recombinase 1-66 Y4rC 332 thiswork 41 55 fe3 − 095343-096025 Int 259 U14952 fragments hom. tointegrase; put. frameshift: 95559-95671 (−2<−1) y4eH nolL −2096093-097193 366 11-359 NolL 373 U22899 63 77 nodulation protein; hyp.acetyl transferase y4eI −2 097914-098225 103 ! hyp. 11.1 kd protein withtransmembrane domain fe6 +3 098358-098657 99 3-98 AatB 410 L12149 40 55hyp. 10.3 kd protein fragment, hom. to C-terminal part of bacterialaminotransferases y4eK +2 098675-099421 248 10-245 Adh 252 U00084 37 53hyp. short chain type dehydrogenase/reductase y4eL +3 099447-100193 2481-244 Gno 256 X80019 31 47 hyp. short chain type dehydrogenase/reductasefe4 + 100270-101901 IlvG M37337 put. fragment; put. frameshifts: 100721(1 < 2), 101728 (2 > 1) fe7 −1 101585-102298 237 1-103 Tnp 398 U08627 9195 put. truncated transposase-like protein; similar to Y4pO y4eN −3102625-102936 103 hyp. 11.5 kd protein y4eO −2 102933-103598 221 hyp.24.5 kd protein y4fA −1 103805-106342 845 327-837 MepA 657 X66502 41 59prob. methyl-accepting 7-845 Y4sI 756 this work 29 49 chemotaxis proteiny4fB +3 106620-108614 664 hyp. 73.7 kd protein y4fC +3 109884-110618 24410-163 DszA 453 L37363 38 52 hyp. (fragmentous?) monooxygenase; extendedhomology to DszA in fr.2: 110372 to 110506. y4fD −1 110516-111178 220hyp. 24.6 kd integral membrane protein y4fE −2 111195-111677 160 hyp.17.2 kd protein precurser y4fF −1 111803-112348 181 hyp. 19.5 kd proteiny4fG −2 112338-112727 129 hyp. 14.5 kd protein y4fH −1 113474-113782 102hyp. 11.6 kd protein ff1 −3 113779-114114 111 61-97 DppF 330 L08399 5686 hyp. protein fragment, similar to central region of oligo/di-peptideABC transporter ATP-binding proteins y4fJ −2 114348-115379 343 3−210RopA 318 M69214 53 66 put. outer membrane protein (point) precurser y4fK−2 116112-117395 427 275-421 XyIS2 157 L02642 31 53 put. transcriptionalregulator (AraC family) y4fL −3 117385-118212 275 9−243 ORF 268 U3905932 46 hyp. 29.1 kd integral membrane protein, belongs to the inositolmonophosphatase family y4fM −2 118209-119144 311 hyp. 35.5 kd proteiny4fN −2 119145-120854 569 11-513 CysU 550 U32807 23 45 prob. ABCtransporter permease protein; put. part of binding- -protein-dependenttransport system Y4fNOP y4fO −1 120851-121870 339 12-247 PotA 381 U3275949 68 prob. ABC transporter AtP- binding protein y4fP −1 121883-122959358 32−293 SufA 338 M33815 23 42 prob. ABC transporter periplasmicbinding protein precurser y4fQ +1 123016-124194 392 9−234 NagC 406X14135 25 46 hyp. 41.6 kd protein; belongs to “ROK” family(transcriptiotial regulator or transferase) y4fR +1 124813-126453 54688-539 JpaH 532 M32063 38 54 hyp. 60.5 kd protein, hom. to invasionplasmid antigen H y4gA −1 126806-127369 187 hyp. 20.9 kd protein; lowsimilarity to Y4rE y4gB −2 127485-127904 139 hyp. 16.1 kd protein y4gC−1 127901-128479 192 1-178 ORF2 415 L34580 43 58 put.integrase/recombinase (“phage-type) y4gD −1 128579-128857 92 hyp. 10.5kd protein y4gE +2 131021-131767 248 hyp. (fragmentous?) 27.7 kdprotein; put. frameshifts: 131532 (2>1), 131892 (1>2) y4gF +2132734-133786 350 4-345 RhsB 353 U51197 65 74 prob. dTDP-D-glucose-4,6-dehydratase (Y4gFGH inv. in dTDP-L-rhamnose biosynthesis) y4gG +2133790-134680 296 1-290 RhsD 288 U51197 48 66 prob.dTDP-4-dehydrorhamnose reductase y4gH +1 134677-135537 286 2-285 RfbA293 U09876 65 82 prob. glucose-1-phosphate thymidylyltransferase y4gI +3135534-138263 909 276-894 RfbC 1275 U36795 38 55 hyp. 102.8 kd protein(homolog is involved in O-antigen biosynthesis) y4gJ −1 138737-139315192 hyp. 21.1 kd protein y4gK fixF +3 142026-143234 402 114-184 KpsS 389X74567 26 54 necessary for functional nitrogen 203-362 30 53 fixation,hom. to capsule polysaccharide export protein y4gL −3 143473-144060 19524-192 RhsC 188 U51197 53 65 prob. dTDP-4-dehydrohamnose- 3,5-epimerase(inv. in dTDP -L-rhamnose biosynthesis) y4gM −2 144147-145907 586 26-581MsbA 582 Z11796 32 56 prob. ABC transporter ATP-binding protein y4gN +2146075-147226 383 52-297 VirA 304 L08012 29 46 hyp. 45 kd protein y4hA−1 147455-148558 367 7−362 ChaA 366 L28709 34 58 put. ionic transportery4hB noeE −3 148819-150078 419 3-138 F42G9.8 359 U00051 32 49 nodulationprotein (put. 197−289 25 50 sulfate transferase) y4hC noeG −3151051-151782 243 18−229 u0002kb 243 U00024 27 42 nodulation protein(unknown function) y4hD nolO −1 151979-154021 680 1-126 NolN 127 L2275670 83 inv. in O-carbamoylation of 140-496 NolO 358 78 89 Nod factors(sim. to NodU) y4hE zodJ −3 154120-154908 262 5−261 NodJ 262 J03685 6984 prob. ABC transporter permease (see nodI) y4hF nod1 −3 154912-155943343 15−343 NodI 339 X55795 69 85 prob. ABC transporter ATP-bindingtransport protein; put role; together with NodJ export of modifiedbeta-1,4 N-glucosamine oligosaccharides y4hG nodC −1 156095-157336 4131-413 NodC 413 X73362 99 100 N-acetylglucosaminyltransferase y4hH nodB−3 157351-157998 215 1-215 NodB 214 X73362 99 99 chitooligosaccharidedeacytelase v4hI nodA −2 579951-158585 196 1-196 NodA 196 X73362 100 100N-acyltransferase; nodABC involved in synthesis of backbone of modifledN-acylated glucosamine oligosaccharides y4hJ −1 158993-159775 260 59-240ORF2 251 L133618 68 81 hom. to part of coproporphyrinogenIII oxidase(lacks C-terminus and conserved N-term. domain) y4hK +3 160722-161465247 hyp. 25.4 kd internal membrane protein y4hL +1 161569-161826 85 hyp.9.6 kd protein y4hM +1 163042-164253 403 53-169 Gfor 439 M97379 31 54hyp. 43.9 kd protein (partially hom. to glucose-fructose oxidoreductase)y4hN +2 164600-165034 144 10-144 ORFA 135 X84099 38 53 hyp. 16 kdprotein; partiaily hom. to Y4jB and Y4rG y4hO +1 165037-165384 115 1-115ORF140 140 X74068 100 100 hyp. 12.8 kd protein 1-115! ORFC 144 X84099 5469 1-115 Y4jC 117 this work 36 62 y4hP +1 165430-167088 552 1-215 no1265266 X74068 97 97 hyp. 61.7 kd protein; similar to 80−328 ORF2 258 M1020467 79 Y4aQ, Y4jD and Y4qI 162-492 ORF3 163 M10204 47 61 y4hQ +3167091-167675 194 5-185 ORF3 237 X51418 35 53 hyp. 21.7 kd protein 1-52ORF91 >91 X74068 96 98 y4hR −3 167710-167934 74 hyp. 8.8 kd protein fi1−1 168208-168300 hyp. transposase fragment similar to R. melilotiISRm2011-2 fi2 +1 168430-168792 120 1-130 Y4iO 252 this work 78 87 put.defective transposase (homologous 1-108 Y4rJ 396 this work 74 87 toN-terminal parts of Y4iO and Y4rJ) fi3 +2 168798-169190 130 1-109 ORF1A317 M33159 37 55 put. defective transposase(hom.to C- 1-130 Y4iO 252this work 78 87 terminal parts of Y4iO and Y4rJ); 1-130 Y4rJ 396 thiswork 76 84 additionally weak homology to Y4pF/Y4sB and Y4qE (<30%identity) y4iR −3 169231-169716 161 15-145 PsiB 134 L26581 55 74 hyp.protein (homolog located in a poly- saccharide biosynthesis inhibitionoperon y4iC −2 169929-170621 230 58-123 ORF 161 Z73419 41 54 hyp. 25.8kd protein (ORF=MTCY373.06) y4iD −3 170563-172551 662 137-342 ORF 495Z73101 40 59 prob. monooxygenase (ORF=MTCY31.20) 418-605 28 51 y4iE +3173295-173702 135 1-135 Y4rL 155 this work 33 52 hyp. 15.4 kd(fragmentous?) protein; similar.to Y4ZA y4iF −3 174211-175128 305 hyp.34.1 kd protein y4iG −2 175590-175862 90 1-73 Y4aT 266 this work 93 97hyp. 10.5 kd (fragmentous?) protein 1-73 Y4bF 457 this work 60 76 y4iH+2 176045-176764 239 1−236 Y4jT 336 this work 32 53 hyp. 26 kd proteinprecurser y4iI −2 176937-179048 703 hyp. 76.2 kd integral membraneprotein y4iJ −2 179097-180887 596 hyp. 65.5 kd protein; low similarityto Y4iM y4iK −3 180940-181638 232 hyp. 26.8 kd protein; y4iKL: twofragments of one gene?; put. frameshift: 181884 (−3<−2) y4iL −2181692-182990 432 hyp. 47.8 kd protein; y4iKL two fragments of onegene?; put. frameshift: 181884 (−3<−2) y4iM −2 183036-184334 432 hyp.47.1 kd protein; low similarity to Y4iJ; y4iMN two fragments of onegene?; put. frameshift: 184440 (−2<−3) y4iN −3 184309-184935 208 hyp.22.1 kd protein precurser; y4iMN two fragments of one gene?; put.frameshift: 184440 (−2<−3) y4iO −2 185679-186437 252 17−243 Tnp 334Z48244 29 46 put. transposase or transposase- 1-121 Fi2 120 this work 6779 fragment; additionally 123-252 Fi3 130 this work 78 87 weak homologyto Y4pF/ 1-252 Y4rJ 396 this work 71 83 /Y4sB and Y4qE (<30% identity)y4iP −1 186437-186832 131 4-163 Y4rJ 396 this work 58 80 hyp. 14.4 kdprotein or fragment hom. to N-term. of Y4rJ y4iQ −3 187162-188058 29813−253 IstB 265 U38187 34 56 identical to Y4nD/Y4sD; put. inser- 8−283Y4bM 263 this work 31 56 tion sequence ATP-binding protein; 5−265 Y4uH248 this work 31 52 similarity to Y4bM/Y4kI/Y4tA, Y4uH and weakly toY4PL y4jA −2 188055-189569 504 147-494 IstA 507 U38187 25 42 identicalto y4nE/y4sE; hyp. 57.2 kd 395-504 Fz4 110 this work 72 85 protein withlow similarity to IS21/IS408/IS1162 transposases y4jB +3 190248-190706152 24-79 ORF1 130 U19148 46 69 hyp. 16.7 kd protein; partiallysimilarity Y4hN; low similarity to Y4rG y4jC +2 190703-191056 117 1-115ORFC 144 X84099 39 58 hyp. 13.1 kd protein; see y4hO 1-117 Y4hO 115 thiswork 36 62 y4jD +2 191105-192640 511 89-298 ORF2 258 M10204 36 53 hyp.56.7 kd protein: see y4hP 340-453 ORF3 163 M10204 28 49 18-183 no1265266 X74068 32 48 y4jE +1 192637-193458 273 hypothetical (fragmentous?)29.4 kd integral membrane protein; put. frameshift: 192996 (1>2; end ofshifted ORF at 193183) y4jF −1 194771-196330 519 hyp. 55.4 kd integralmembrane protein y4jG −3 196333-196821 162 hyp. 17.9 kd transmembraneprotein y4jH −2 196818-197435 205 hyp. 23 kd protein y4jI −3197428-197820 130 hyp. 13.6 kd protein y41J +1 198043-198300 85 1-85StbC 103 L48985 67 76 put. plasmid stability protein y4jK +3198297-198719 140 1-138 StbB 139 L48985 57 76 put. plasmid stabiiityprotein y4jL +3 199002-199664 220 hyp. 25.1 kd protein y4jM −2199746-199958 70 1-58 Y4bF 457 this work 75 79 hyp. 8 kd protein orprotein fragment 15-58 fb1 188 this work 50 64 y4iN −3 199975-200415 146hyp. 16.3 kd protein y4jO −3 201514-202479 321 hyp. 36.1 kd protein;y4jOP: two fragments of one gene?, put. frameshift: 202550 (−3<-1) y4jP−1 202406-203194 262 hyp. 29.5 kd protein; y4jOP: two fragments of onegene?, put. frameshift: 202550 (−3<−1) y4iQ +2 203729-206848 1039 hyp.115.9 kd protein y4iR +1 206860-207315 151 hyp. 17.3 kd protein y4iS +1207316−208557 413 hyp. 44.8 kd protein y4iT −1 208877-209887 336 17−283Y4iH 239 this work 32 53 hyp. 36.4 kd protein precurser y4kA −3209917-210885 322 hyp. 36.7 kd protein y4kB +1 211663-212088 141 hyp.15.2 kd integral membrane protein fk2 −1 212111-212479 122 58-116 ORFl4104 X00493 59 76 hyp. fragment; sim. to Y4hP, Y4jD and Y4qI; additionalhomology to ORF14 in fr. +3/+2: 212331-212509 y4kD −1 212750-214399 549hyp. 60.4 kd protein y4kE −1 214412-215455 347 hyp. 38 kd protein;y4kEF: two fragments of one gene?, put. frameshift: 215616 (−1<−2) y4kF−2 215439-216743 434 hyp. 47.4 kd protein; y4kEF: two fragments of onegene?, put. frameshift: 215616 (−1<−2) y4kG −2 216855-217064 69 hyp. 7.7kd protein y4kH −3 217105-217488 127 hyp. 14.1 kd protein y4kI −1217670-218461 263 — — — — — — see y4bM y4kJ −3 218458-220008 516 — — — —— — see y4bL y4kK −1 220103-221041 312 hyp. 34.9 kd protein y4kL −2221049-222041 330 101-296 ORF300 300 U23723 39 56 hyp. 37.6 kdAAA-family ATPase protein y4kM +2 222641−222994 117 hyp. 13.1 kd proteiny4kN +2 223115-223537 140 hyp. 15.7 kd protein y4kO +2 223970-224218 82hyp. 9.2 kd protein y4kP +1 224215-224505 96 hyp. 11 kd protein y4kQ −2224898-225326 142 hyp. (fragmentous?) 15.3 kd protein; homology to hipOfragments on the complementary strand fk1 +3 225094-225473 Z36940fragments hom. to hipO y4kR −3 225535-225666 43 1-36 ORF6 347 M87280 5566 hyp. 4.8 kd (fragmentous?) protein (smallest ORF predicted to be aprotein); hom. to N-term. of protein in crtE-crtX intergenic region y4kS−3 225751-226656 301 1-301 ORF8 300 U12678 93 94 hyp. 33.2 kd proteiny4kT −2 226653-228203 516 1-516 ORF7 516 U12678 93 94 hyp. 55.1 kdprotein y4kU −3 228514-229512 332 1-332 ORF6 332 U12678 90 94 prob.geranyltranstransferase y4kV −3 229666-231009 447 92-447 CYP117 356U12678 89 94 cytochrome P450 BJ-4 homolog y4lA −2 231009-231845 2781-274 ORF4 275 U12678 83 87 short-chain type dehydrogenase/reductaseyrlB −3 231832-232140 102 1-58 ORF3 94 U12678 93 98 put. P450-system3Fe-3S ferredoxin y4lC −2 232170-233573 467 48-428 CYP114 382 U12678 9093 cytochrome P-450 BJ-3 homolog y4lD −1 233666-234868 400 3-400 CYP112401 U12678 92 95 cytochrome P-450 BJ-1 homolog fl3 −2 235704-235904 662-54 ORF8 >207 X66124 60 71 hyp. 7.6 kd protein fragment, homology toORF8 fragments also upstream of fl3 up to 236048 fl1 − 236796-237416Z36981 homology to hupK/hupJ fragments (fr. −3/−2) y4lF +1 237508-238479323 hyp. 36.1 kd protein y4lG +2 238490-238975 161 hyp. 17.4 kd proteiny4lH −2 238959-239537 192 3-184 Fic 200 M28363 34 51 hyp. 22.4 kdprotein; hom. to cell filamentation/division protein y4lI −2239541-239750 69 hyp. 7.3 kd protein y4lJ −3 240358−240861 167 hyp. 18.1kd protein fl2 − 240920-241040 X65471 fragments of transposase (ISRm4)y4lK +1 241207-241605 132 hyp. 14.3 kd protein y4lL −2 241845-244328 827118-816 SLR0359 1244 D63999 33 50 hyp. 91.8 kd protein (member of E.coli YegE/YhdA/YhjK/YicC family) fl4 +1 244540-244851 103 19-103 TnpA990 L14931 39 51 put. truncated transposase; hom. to 28-81 F15 112 thiswork 94 98 N-term. of TnpA (transposon Tn163); strong similarity to toC-terminus of F15 Y4lN +3 244848-245330 160 hyp. 18.1 kd protein y4lO −3247156-247938 260 11-216 AvrRxv 373 L20423 36 50 hyp. 29.1 kd protein;hom. to avirulence protein; put, frameshift according to homolog:247230-247293 (−2<−3): end of shifted frame: 246960 fl5 +1 248290-248628112 59-112 F14 103 this work 94 98 hyp. protein fragment; strongsimilarity to part of F14 fl6 +3 248814-249680 288 8-286 Tnp 988 M9729727 49 put. fragmentous transposase; homologous C-term. of transposase(Tn1546) y4lR +3 249696-251264 522 hyp. 56.8 kd protein y4lS +1251407-251958 183 3-176 PaeR7IN 195 S78872 42 56 put.integrase/recombinase 4-181 Y4cG 305 this work 40 60 (“resolvase-type”)y4mA +3 251955-252380 141 hyp. 15.8 kd protein fm1 − 254694-254920fragments hom. to xylitol-dehydrogenase y4mB +3 255450-256139 229 59−229ORF4 212 X13583 33 53 hyp. 24.6 kd outer membrane protein precurser y4mC+2 256811-257524 237 hyp. 26.2 kd protein precurser y4mD −1258065-258334 89 hyp. 10 kd protein y4mE −3 259030-260292 420 6-334 HipA440 M61242 32 46 hyp. 45.7 kd protein 2-417 Y4dM 409 this work 34 56y4mF −2 260289-260519 76 11-47 ORF3 90 X06090 37 70 hyp. transcriptionalregulator; very low similarity to phage repressor proteins y4mG +3261174−261395 73 hyp. 7.8 kd protein y4mH −2 261747−262640 297 hyp. 33.9kd protein y4mI −2 262698-263672 324 11-252 RbsB 296 M13169 25 49 prob.ABC transporter periplasmic binding protein precurser (transport systemY4mIJK probably transports a sugar) y4mJ −3 263716-264717 333 12-323RbsC 321 M13169 34 55 prob. ABC transporter permease y4mK −2264714-266207 497 8-489 RbsA 501 M13169 34 55 prob. ABC transporterATP-binding protein y4mL −3 266218-267477 419 1-418 HI1029 425 U00079 3358 put. permease (E. coli YiaN/YgiK family) y4mM −2 267474-269099 54138-360 HI1028 328 U32729 33 54 put. permease (SBR family 7) y4mN −1269096-270133 345 37−340 Tkt 655 U09256 36 54 hyp. transketolase familyprotein (fragmentous?); hom. to C-term. of transketolases y4mO −3270130-270969 279 9−270 Tkt 655 U09256 36 52 hyp. transketolase familyprotein (fragmentous?); hom. to N-term. of transketolases y4mP −3271000-271761 253 4-249 F09E10.3 255 U41749 41 60 put. short-chain typedehydrogenase/reductase y4mQ +1 271909-272805 298 1-289 PerR 297 U5708048 65 hyp. transcriptional regulator (LysR family) y4nA −2 273204-275384726 45-302 ORF 690 D14005 21 36 prob. peptidase; very low similarity to365-718 38 54 Y4qF and Y4sO (<25% identity) y4nB nodU −3 276451-278127558 1-558 NodU 558 X89965 100 100 inv. in 6-O-carbamoylation of Nodfactors; similar to Y4hD y4nC nodS −1 278144-278794 216 1-216 NodS 216J03686 100 100 methyltransferase inv. in Nod-factor synthesis y4nD −3280453-281349 298 — — — — — — see Y4iQ y4nE −2 281346-282860 504 — — — —— — see Y4jA fn1 + 283238-283467 241 M26938 hom. to virG fragments;similar to fq3 y4nF +3 283809-284501 230 hyp. 25.4 kd protein precurser;low similarity to Y4aO (<30% id.) fn2 − 284752-284923 X79443 fragmentshom. to ORF2 (IS-ATP-binding protein) from IS1162 y4nG +2 285407-286597396 53-365 ORF4 333 U08223 31 47 put. NAD-dep. nucleotide sugarepimerase(dehydrogenase y4nH +1 286594-286947 117 5-113 MvrC 110 M6273230 47 hyp. 12.3 kd integral membrane protein (some similarity toethidium bromide resistance proteins) y4nI +2 286964-287326 120 hyp. 13kd transmembrane protein y4nJ +1 287335-288852 505 80−266 BetA 548U39940 29 44 hyp. GMC-type oxidoreductase 343-468 32 45 y4nK −2288906-290894 662 hyp. integral membrane protein y4nL −3 290914-291984356 14-345 ORF6 328 U47057 26 45 put. NAD dep. nucleotide sugarepimerase/dehydrogenase y4nM −3 292003-293553 516 226-514 NoeC 307L18897 30 52 put. permease y4oA −3 294502-296283 593 328-494 MccB 350X57583 29 41 hyp. 65.2 kd protein; homolog inv. in production of the4-590 Y4qC 583 this work 30 50 translation inhibitor microcin C7 y4oB +1296572-296961 129 hyp. 14.7 kd protein y4oC +1 296965-297657 230 hyp. 26kd protein y4oD −1 297746-298390 214 hyp. 23.5 kd protein y4oE −3298939-299148 69 hyp. 7.4 kd protein fo1 −2 299145-299588 147 fo1 andfo2: two fragments of one put. gene; put. frameshift: 299664 (−2<−3) fo2−3 299578-299955 125 25-109 ORF11 344 X53264 37 63 homology to 5 part ofORF11; 1-123 Y4cM 325 this work 25 51 fo1 and fo2: two fragments of oneputative gene; put. frameshift: 299664 (−2<−3) fo3 +3 300015-300815 26715-252 Tnp 518 L09108 40 59 fo3 and fo7: transposase-like proteininterrupted by NGRIS-6 fo4 −2 300828-301259 143 1-143 Y4bA 694 this work77 83 hyp. fragment; f04/5/6: fragments of one gene similar to Y4bA/Y4pHfo5 −1 301274-301684 136 1-127 Y4bA 694 this work 83 94 hyp. fragment;f04/5/6: fragments of one gene fo6 −2 301608-302900 430 1-393 Y4bA 694this work 89 95 hyp. fragment; f04/5/6: fragments of one gene y4oL −3302890-303156 88 1-88 Y4bB 98 this work 63 69 hyp. 9.6 kd protein y4oM-1 303179-303628 149 1-149 Y4bC 149 this work 79 88 hyp. 16.8 kd proteiny4oN −2 303810-304022 70 1-70 Y4bD 89 this work 73 84 hyp. 8.1 kdprotein fo7 +2 304118-304453 111 4-103 Tnp 518 L09108 40 59 fo3 and fo7:transposase-like protein interrupted by NGRIS-6 y4oP +1 304861-306156431 47-429 u1756v 469 U15180 27 42 prob. ABC transporter binding protein(Y4OPQRS: sugar- like transport system) y4oQ +2 306236-307165 309 31−301MalF 310 U15180 35 56 prob. ABC transporter permease protein y4oR +2307178-308011 277 12−277 MalG 296 U15180 30 52 prob. ABC transporterpermease protein y4oS +1 308008-309123 371 7−369 UgpC 369 U00039 50 68prob. ABC transporter ATP.binding protein y4oT −2 309132-309722 1962-196 Y4pA 609 this work 28 50 hyp. 20.6 kd protein; homologous toN-terminus of Y4PA, and weakly to Y4oV y4oU +1 309853−311061 402 hyp.43.1 kd protein precurser y4oV +2 311051-311908 285 3-280 Y4pA 609 thiswork 32 56 hyp. 30.2 kd protein; homologous to N-terminus of Y4PA, andweakly to Y4oT y4oW +1 311911-312561 216 hyp. 23.7 kd protein y4oX +3312606-313688 360 36-233 MocA 317 X78503 29 44 prob. NAD-dep.oxidoreductase y4pA +1 313714-315543 609 310-596 HydG 441 U00006 33 50put. transcriptional regulator (sigrra54-dep.) 6-290 Y4oV 285 this work32 56 35-237 Y4oT 196 this work 28 50 y4pB otsB +3 316350-317147 26530-260 OtsB 266 X69160 41 57 prob. trehalose-phosphate phosphatase y4pCotsA +1 317185-318579 464 1-456 OtsA 474 X69160 46 66 prob.trehalose-6-phosphate synthase; similar to fq1/2 fp1 + 318915-319242U08864 fragments homologous to ORF3; put. frameshift acc. to homologue:319122 (3>1) fp2 + 319236-319670 U08864 fragment homologous to ORF1 fromIS1248 (fr. 3); similar to fs4 Y4pD −1 319601-320116 171 13-140 Ros 142M65201 50 71 put. transcriptional regulator (MucR family); missing Znfinger motif; similar to Y4ap y4pE −1 320606-321013 135 1-135 222 U1876491 94 identical to y4sA; hyp. 15.5 kd protein hom. to N-term. of RFRS92SkDa protein y4pF −2 321297-322460 387 50−374 Tnp 334 Z48244 43 60identical to y4sB; put. transposase; low similarity to Y4qE, Y4iB andY4jO (<30% aa4d.) y4pG −3 322486-323064 192 1-191! ORFA 197 U22323 47 64identical to y4sC; hyp. 21.1 kd protein fp3 +2 323189-323956 X79443“ORF” homologous to ORF1 of ISI162 interrupted by stop codon (323444)y4pH −1 323969-326053 694 — — — — — — see y4bA y4pI −2 326043-326309 88— — — — — — see y4bB y4pJ −3 326329-326778 149 — — — — — — see y4bC y4pK−1 326969-327238 89 — — — — — — see y4bD fp4 +1 327277-328059 LO9108 4865 fragment homologous to put. IS-ATP-binding protein y4pL +3328071-328808 245 1-204 ORF2 231 X79443 51 63 put. insenion sequenceATP-binding protein: similarity to 1-242 Y4bM 263 this work 55 73Y4bM/Y4kI/Y4tA, Y4uH, and weakly to 1-245 Y4uH 248 this work 61 77Y4iQ/Y4nD/Y4sD (<30 aa-id.) y4pM +2 329159-329977 272 hyp. 30.9 kdprotein fp5 − 330657-331414 put. frameshift: 331032 (2<1) y4pN syrM1 −3332506-333522 338 13-324 SyrM 326 M33495 63 77 probable symbioticregulator (LysR family) 1-338 SyrM2 339 thiswork 62 79 y4pO +1335062-336264 400 1-400 Tnp 400 M60971 96 98 prob. transposase (Mutaterfamily); similarity to fe7 fq2 −2 333987-335003 338 1-320 OtsA 474X69160 44 61 join fq1 + fq2: hom. to trehalose-6-phosphate synthaseinterupted by ISRm3-like element NGRIS-8; similarity to Y4pC (45%aa-id.) fq1 −1 336311-336694 128 44-174 OtsA 474 X69160 48 67 see fq2fq3 + 337338-338056 M26938 virG homologous fragments: stop at 37380;put. frameshift at 337844 (3>2); similar to fn1 y4qB −1 339053-339547164 hyp. 18.8 kd protein y4qC −3 339535-341286 583 314-489 ORF 401Z54354 28 46 hyp. 63.6 kd protein 1-583 Y4oA 593 this work 30 50 y4qD −3343216-343950 244 1-244 Y4oA 618 this work 55 74 hyp. 26.8 kd protein,similar to N-terminus cf Y4rO y4qE +2 344114-345286 390 37-380 Tnp 364X77623 38 57 prob. transposase; low similarity to Y4pF/Y4sB, Y4iB, Y4iOand Y4rJ (<30% aa-id.) fq4 +3 345798−346130 M38257 34 51 fragmentshomologous to XerC (integrase) y4qF −2 346215-348479 754 41-725 PtrII707 D10976 31 49 prob. peptidase (S9A family); high similarity to Y4sO;32-736 Y4sO 705 this work 70 84 low similarity to Y4nA (<25% id.) y4qG−2 348501-349847 448 40-389 YgjG 454 U32722 42 62 prob. aminotransferase(class 3) y4qH −1 350294-351274 326 144-326 LasR 239 M59425 37 51 hyp.transcriptional regulator (LuxR family) y4qI −2 351837-353456 539146-419 ORF1 322 M25805 44 63 hyp. 59.7 kd protein; similar to Y4aQ,Y4hP, Y4iD fq5 −3 353533−353775 fragments fq5 and fr3 represent one put.gene similar to Y4hO and Y4jC interrupted by IS elements y4qJ −1354140-355336 398 7-395 TnpA 388 U14952 42 60 put. transposase y4qK −2355344-356270 308 51-293 Int 259 U14952 39 55 put. integrase/recombinase(“phage-type”); similar to 51-308 Y4eF 251 this work 92 94 Y4rF; lowsimilarity to Y4rABCDE fq6 −2 356436-356636 66 1-66 Fe5 66 this work 7994 put. defective integrase/recombinase (“phage-type”); 75% Y4rC 332this work 45 62 nt-identity: 356436-356710 and 94988-95262 R[20] y4rA +1356803-358032 409 17-397 ORF2 415 L34580 39 55 put.integrase/recombinase (“phage-type”) y4rB +3 358029-358973 314 135-267TnpI 284 X07651 30 51 put. integrase/recombinase (“phage-type”) y4rC +2358970-359968 332 22-294 XerC 295 U32696 31 50 put.integrase/recombinase (“phage-type”) 267-332 Fe5 66 this work 41 55267-332 Fq6 66 this work 45 62 y4rD −3 360025-360870 281 15-277 XprB 298M54884 25 46 put. integrase/recombinase (“phage-type”) y4rE −2360867-361799 310 50-288 YqkM 296 D84432 27 48 put.integrase/recombinase (“phage-type”); low similarity to Y4gA y4rF −1361796-363073 425 126-414 ORF2 415 L34580 34 49 put.integrase/recombinase (“phage-type”) y4rG −1 363287-363694 135 16-109ORF1 130 U19148 32 48 hyp. 14.8 kd protein (IS866 family); lowsimilarity to Y4jB, Y4hN y4rH −3 363895-365331 478 62-374 Bcp 598 X6347026 44 put. ligase; hom. to biotin carboxylases fr1 −3 366307-366669 85%aa-identity to part of Y4rL fr2 − 366594-367402 put. frameshift: 367296(−2<−1) fr3 −3 367705-367827 hom. to N-term. of Y4hO; see fq5 y4rI −3368503-369675 390 hyp. 44 kd protein y4rJ +1 369697-370887 396 152−379Tnp 339 M80806 28 45 put. transposase; low similarity to Y4qE (<30%aa-id.) 135−244 Y4iA 120 this work 74 87 266−396 Y4iB 130 this work 7684 135−396 Y4iO 252 this work 71 83 2-131 Y4iP 131 this work 58 80 y4rK−1 370976-371350 124 hyp. 14.5 kd protein y4rL −2 371454-371921 155 1-99Y4zA 295 this work 99 99 hyp. 17.7 kd protein; y4rLM: two fragments ofone 17-155 Y4iE 135 this work 33 52 gene?; put. frameshift: 371972(−2<−3); 85-99% aa- identity to parts of Y4ZA and fr1 y4rM −3371938-372990 350 258-339 Y4zA 295 this work 98 98 hyp. 39.4 kd protein;see y4rL y4rN −2 373578-374795 405 35-368 P43 416 X57470 26 44 hyp. 41.6kd integral membrane protein y4rO +1 375313-377169 618 274-596 HIN0578366 U32742 25 45 hyp. 69.3 kd protein; N-terminus: hom. to Y4qD; C-1-244! Y4qD 244 this work 55 74 terminus: hom. to C-terminus ofhistidinol-1-phosphate transaminase fr4 + 377185-377534 X66016 sim. toY4rG; put. frameshift: 377376 (1>3); hom. to fragment of ORFA3(377409-377540) y4sA −3 377842-378249 135 — — — — — — see y4pE y4sB −1378533-379696 387 — — — — — — see y4pF y4sC −2 379722−380300 192 — — — —— — see y4pG y4sD −1 380933-381829 298 — — — — — — see y4iQ y4sE −3381826-383340 504 — — — — — — see y4jA fs5 −3 383593-384054 153 8-150Tnp 334 Z48244 48 65 put. defective transposase; sim. to fs1384210-384493 fragments with 94-84% nt-id. to ISRm6 (R. meliloti; acc.no. X95567) y4sG +1 384808-385818 336 97-325 Dd1 306 M14029 34 57 hom.to D-alanine:D-alanine ligase; probably different function y4sH +3386505-387890 461 267-337 CapA 411 M24150 42 63 hom. to encapsulationprotein A; nearly identical to Y4uA fs1 − 388138-388586 Tnp Z48244fragments of put. transposase; put. frameshift: 388452 (−3<−2); sim. toY4pF, Y4sB, fs5 fs2 +2 388697-388897 ORF1 U19148 43 62 put. transposasefragment; hom. to N-term. of ORF1; sim. to Y4jB, Y4rG, Y4hN fs3 +388966-390695 AtoC U17902 put. transcriptional regulator fragment (put.frameshifts: 389891 (1>2); 390170 (2>3)); sim. to Y4pA, Y4oV, Y4oT) y4sI+2 390971-393241 756 325-741 McpA 657 X66502 41 60 prob.methyl-accepting chemotaxis protein 1-749 Y4fA 845 this work 29 49 y4sJgapD −3 393202-394677 491 29-489 GabD 482 M88334 58 75 prob.succinate-semialdehyde dehydrogenase y4sK −1 394790-395170 126 5-122C23G10.2 185 U39851 55 71 bel. to the YER057C/YIL051C/YJGF family;probably important cellular function y4sL −1 395204-395815 203 2-203DadA 432 L02948 57 74 either functional dehydrogenase or non-functionalfragment; hom. to small subunit of D-aminoacid dehydrogenase y4sM +1395935-396318 127 1-127 ORF1 127 X74314 99 99 put. transcriptionalregulator (AsnC/Lrp family; low homology to y4tD); missing H-T-H regiony4sN +1 396523-396900 125 1-123 ORF2 >123 X74314 98 98 similar to ORFsderived from insertion elements (IS6501 family); low similarity to fu4fs4 + 396855-397283 (143 8-141 ORF1 186 X53945 48 63 put. IS-derivedprotein fragment (homology to C-term. of 1-141 Fp2 145 this work 39 62ORF1 from IS869) y4sO −2 397608-399725 705 10-694 PtrII 706 D10976 32 49prob. peptidase (S9A family); low similarily to Y4nA 1-705 Y4qF 754 thiswork 70 84 (<25% id.) ft1 +3 400377-400625 (83) 20-83 Y4tE 300 this work64 78 ft1 and ft2: one put. gene encoding an amino acid ABC transponterbinding protein interrupted by NGRIS-3c y4tA −3 400732-401523 263 — — —— — — see y4bM y4tB −2 401520-403070 516 — — — — — — see y4bL ft2 +1403249-403899 (216) 5-195 ArgT 260 V01368 25 48 see ft1 2−215 Y4tE 300this work 76 86 y4tD +1 404182-404691 169 11-161 HIN1362 168 U32817 3864 put. transcriptional regulator (AsnC/Lrp family; but low homology toy4sM) y4tE +1 405157-406059 300 31-281 FliY 257 U32734 27 48 prob.aminoacid ABC transporter binding protein 86-299 Ft2 215 this work 76 86(periplasmic); prob. part of binding-protein- dep. transport systemY4tEFGH y4tF +1 406111-406827 238 25-233 YckJ 234 X77636 35 54 prob.aminoacid ABC transporter permease protein Y4tG +3 406830-407525 2311-220 GlnP 226 D30762 32 54 prob. aminoacid ABC transporter permeaseprotein y4tH +2 407522-408295 257 5-256 GlnQ 242 M61017 52 71 prob.amino acid ABC transporter ATP-binding protein y4tI +1 408745-409953 40222-391 Slr0072 393 D64004 35 54 put. peptidase (M40 family) y4tJ +1409990-410988 332 7-328 Thd2 329 M21312 35 57 put. threonine dehydratasey4tK +3 410988-411983 331 69-326 ArcB 351 U39262 30 44 hyp.cyclodeaminase; (sim. to ornithine cyclodeaminase) y4tL +2 412118-413290390 10-384 ORF 411 D14463 27 45 hyp. hydrolase/peptidase (M24 family)1-389 Y4tM 392 this work 34 53 y4tM +2 413453-414631 392 17-390 PepQ 368Z34896 24 43 put. hydrolase/peptidase (M24 family) 1-390 Y4tL 390 thiswork 34 53 y4tN +1 414655-415179 174 hyp. 19.6 kd protein y4tO +1415252-416847 531 1-484 OppA 543 M60918 28 46 prob. peptide ABCtransporter binding protein precurser; prob. part of abinding-protein-dependent transport system Y4tOPQRS y4tP +2416852-417793 313 4-313 DPPB 339 L08399 36 58 prob. peptide ABCtransporter permease protein y4tQ +1 417796-418671 291 9-287 AppC 303U20909 36 56 prob. peptide ABC transporter permease protein; 418611: Cor T possible! y4tR +2 418673-419680 335 12-327 OppD 336 X56347 50 68prob. peptide ABC transporter ATP-binding protein y4tS +1 419677-420738353 3-320 AppF 329 U20909 49 69 prob. peptide ABC transporterATP-binding protein y4uA +3 420774-422159 461 267-337 CapA 411 M24150 4263 put. cell wall compound biosynthesis protein; almost identical toY4sH y4uB +3 422628-424031 467 1-464 BioA 448 U51868 33 57 prob.aminotransferase (class 3) y4uC +3 424056-425594 512 58-509 GabD 482M88334 33 52 prob. aldehyde dehydrogenase fu1 +2 425699-425779 N15K 238D45911 put. protein fragment; 67% id. to N15K in 26 aa fu2 +3425841-426083 PhbA 393 U17226 fragment 65% identical to C-term. ofbeta-keto-thiolase y4uD +1 426010-426507 165 hyp. 18.7 kd protein y4uE−3 426949-428028 359 78-290 Tnp 414 X15942 31 45 put. transposase (IS110family); put. frameshift: between 427040 and 427180 (−2<−3; end ofshifted ORF: 426699) y4uF +3 428292-429623 443 13-440 GLUD1 558 X0767442 60 prob. glutamate dehydrogenase fu3 + 429860-430007 Tnp 398 U08627put. transposase fragment (92% id. in 16 aa); 85% nt- identity to3′term. part of ISRm5 y4uG +1 430105-430320 71 hyp. 7.8 kd protein y4uH−1 430538-431284 248 1-202 ORF2 231 X79443 48 63 put insertion sequenceATP-binding protein; similarity to 1-245 Y4pL 245 this work 61 77 Y4pL,Y4bM/Y4kI/Y4tA and Y4iQ/Y4nD/Y4sD 1-248 Y4bM 263 this work 48 68(IS21/IS1162 family) 4-248 Y4iQ 298 this work 31 52 y4uI −3431296-432840 514 1-514 Tnp 518 L09108 44 63 put. transposase;similarity to Y4bL/Y4kJ/Y4tB (IS21/IS1162 family) fu4 − 433222-433560Tnp 201 X65471 put. transposase fragments (74-92% id. in 88 aa); 79% nt-identity to 5′term. of ISRm4 y4uJ fixU −1 433880-434110 76 1-70 FixU 70X51963 63 80 hyp. 8.5 kd ptotein y4uK nifZ −3 434107-434433 108 6-79ORF2 >78 X07567 52 78 put. nitrogen fixation Nifz protein y4uL fdxN −2434517-434711 64 1-64 FdxN 64 M21841 79 84 prob. 4Fe-4S ferredoxin y4uMnifB −1 434753-436234 493 1-493 NifB 490 M15544 72 81 involved in FeMocofactor biosynthesis y4uN nifA −1 436460-438244 594 37-594 NifA 584U31630 62 74 positive regulator of nif, fix, and additional genes(sigm54-dep.) 4yuO fixX −2 438297-438590 97 2-97 FixX 98 M15546 84 89prob. 3Fe-35 ferredoxin inv. in nitrogen fixadon y4uP fixC −1438605-439912 435 1-435 FixC 435 M15546 82 89 required for nitrogenaseactivity y4vA fixB −2 439923-441032 369 18−363 FixB 353 M15546 79 87putatively inv. in a redox process in nitrogen fixation y4vB fixA −2441042-441899 285 1-280 FixA 292 M15546 75 90 putatively inv. in a redoxprocess in nitrogen fixation fv1 −1 442181-442252 Nifs 384 X68444 put.NifS fragment (70% idendtity in 24 aa) y4vC −1 442316-442636 106 1-106ORF118 118 X13691 54 72 hyp. 11 kd protein (HesB/YadR/YfhF family);homologues located upstream of nifS y4vD −2 443313-443879 188 5-173HIN1693 241 U32848 46 60 put. redox enzyme (hom. to glutaredoxin-likemembrane protein and peroxysomat membrane proteins) y4vE nifQ +1444337-445029 230 56-212 NifQ 180 M26323 39 56 putatively involved in Mocofactor processing y4vF dctA1 +2 445088-446602 504 1-443 DctA1 456S38912 99 99 C₄-dicarboxylate transport protein; nt-deletion at 446416in comparison to sequence of acc. no. S38912 causing a frameshift (DctA1is 48 aa longer than DctA1 in S38912) y4vG +1 446599-447843 414 1-3413CamC 415 M12546 34 50 prob. cytochrome P450 y4vH +1 447844-448500 218(32-157 LinA 155 D90355 28 46) hyp. 24.6 kd protein (with very weakhomology to gamma-hexachlorocyclohexane-dechlorinase) y4vI +3448557-450203 548 9−250 FabG 244 U39441 38 56 short-chain typedehydrogenase/reductase 276-513 30 48 y4vJ +2 450341-451396 351 1-188LuxA 357 M36597 27 47 put. monooxygenase; similar to Y4wF; y4vK nifH1 +1451993-452883 296 1-296 NifH 296 M26961 99 99 Fe protein of nitrogenasey4vL nifD1 +1 452980-454494 504 199-393 NifD >195 M26962 98 99alpha-subunit of MoFe protein of nitrogenase y4vM nifK1 +3 454590-456131513 132-195 NifK >64 M26963 100 100 beta-subunit of MoFe protein ofnitrogenase y4vN nifE +1 456187-457677 496 1-469 NifE 547 X56894 62 78involved in FeMo cofactor biosynthesis y4vO nifN +1 457687-459096 4691-455 NifN 441 M18272 70 81 involved in FeMo cofactor biosynthesis y4vPnifX +3 459093-459575 160 22-156 NifX 159 X17433 52 68 nitrogen fixationprotein y4vQ +3 459579-460067 162 22-162 ORF4 156 X17433 49 70 hyp. 17.7kd protein, similar to proteins of other 1-162 Y4xD 162 this work 61 75nitrogen-fixing bacteria and to Y4XD y4vR +1 460501-460920 139 1-58 NifH296 M26961 50 63 similar to N-term. of Fe protein of nitrogenase y4vSfdxB +2 461228-461545 105 1-88 ORF5 102 M26323 52 65 prob. 4Fe-4Sferredoxin y4wA +1 463201-464739 512 86-499 PqqE 709 LA3135 50 70 hyp.zinc protease M16 family); sim. to Y4wB y4wB +3 464736-466079 447236-438 PqqF 213 LA3135 42 61 put. protease (lacks Zn-binding site; M16family); sim. to Y4wA y4wC +3 466590-467021 143 8-132 ORF3 127 L13845 4866 put. DNA-binding protein; high similarity to Y4aM 1-143 Y4aM 143 thiswork 69 77 y4wD +1 467758-468891 377 11-370 MosC 407 U23753 29 48permeasc-type protein; hom. to membrane protein from the rhizopinebiosynthesis (mosABC) gene cluster y4wE +3 469311-470417 368 20-361 His1356 D14440 32 53 prob. aminotransferase (class 2) y4wF +1 470824-471852342 40-194 LuxA 354 X06758 27 54 put. monooxygenase; sim. to Y4vJ y4wG+2 471890-472435 181 hyp. 19.4 kd protein y4wH +3 473343-473780 1451-145 ORF2 145 M19352 64 76 hyp. 15.6 kd protein y4wI −2 473928-475469513 hyp. 59 kd protein y4wJ −2 475503-475880 125 hyp. 13.3 kd proteiny4wK nifW −1 476519-476971 150 12-118 NifW 108 M86823 50 63 NifW proteinhomolog; required for full activity of FeMo protein y4wL nifS −2477135-478298 387 4−387 NifS 402 M17349 58 73 prob. NifS protein (memberof class-5 pyridoxal- phosphate-dep. aminotransferase family) y4wM −2479145-481136 663 225-620 YejA >409 U00008 38 55 put. ABC transporterbinding protein (transporter or enzymatic function) fw1 −1 481460-481834124 1-116 DctA 441 M26531 55 61 hyp. truncated transporter-like protein;hom. to N-term. of DctA (see y4vF); two frameshifts acc. to homologue:481606 (−3<−1); 481530 (−2<−3; homology stops at 481419) y4wO −3481834-482154 106 hyp. 11 kd protein y4wP +2 482540-482947 135 hyp. 14.9kd protein y4xA nifH2 +1 483871-484761 296 1-296 NifH 296 M26961 99 99Fe protein of nitrogenase y4xB nafD2 +1 484858-486372 504 199−393NifD >195 M26962 98 99 alpha-subunit of MoFe protein of nitrogenase y4xCnifK2 +3 486468-488009 513 132-195 NifK >64 M26963 100 100 beta-subunitof MoFe protein of nitrogenase y4xD +3 488262-488750 162 22-162 ORF4 156X17433 47 73 hyp. 18 kd protein; similar to proteins of other nitrogen-2-162 Y4vQ 162 this work 61 75 fixing bacteria and to Y4vQ y4xE +1488773-488976 67 1-64 ORF1 69 X55450 40 67 hyp. 7.6 kd protein; similarto proteins of other nitrogen- fixing bacteria y4xF +3 488973-489149 58hyp. 6.5 kd protein y4xQ +2 489281-489583 100 14-83 ExoX 98 M61751 31 52put. exopolysaccharide production repressor (integral membrane protein)y4xG +2 490010-491527 505 hyp. 55.5 kd protein y4xH nodD2 −2491655-492593 312 1-312 NodD2 312 L38460 99 99 transcriptional regulator(LysR family); high similarity to 1-310 NodD1 322 this work 68 83 Y4aL,(NodD1) y4xI +2 494297-494977 226 1-224 PmrA 222 L13395 39 58 signaltransduction-type regulator y4xJ +1 495157-496428 423 76-378 GPIV 426J02451 27 46 hyp. protein hom. to proteins of the general secretionpathway (pulD family), sim. to Y4yD (NolW) y4xK +1 496438-497004 188hyp. 20.6 kd protein precurser y4xL −1 497444-498460 338 hyp. 37.1 kdprotein y4xM −1 498719-499933 404 23-403 ORF1 408 X59939 22 49permease-type protein (YceE) y4xN −3 499930-501816 628 183-505 IucC 580X76100 28 43 hyp. 71 kd protein hom. to aerobactin synthetase subunity4xO −2 501816-502955 379 hyp. 40.9 kd protein y4xP −1 502952-503962 3365-304 CysK 308 D26185 40 60 put. cysteine synthase y4yA −1 503963-505336457 hyp. 49.9 kd protein; low similarity to diaminopimelate decrboxylasey4yB −3 505336-505800 154 hyp. 17.1 kd protein y4yC nolX −2505950-507740 596 1-596 NolX 596 L12251 98 99 nodulation protein as inR. fredii USDA257 y4yD nolW −3 508021-508725 234 1-234 NolW 234 L1225199 100 nodulation protein (PulD family); sim. to Y4xJ y4yE nolB +3508881-509375 164 1-164 NolB 164 L12251 98 99 nodulation protein y4yFnolT +3 509385-510254 289 1-289 NolT 289 L12251 96 97 nodulation proteinprecurser (YscJ homolog; M74011) y4yG nolU +2 510251-510889 212 1-212NolU 212 L12251 99 99 nodulation protein y4yH nolV +3 510891-511517 2081-60 ORF4 65 L12251 100 100 homologous to two (nodulation) proteins ofR.fredil 73-208 NolV 135 96 97 USDA257 (YscL homolog; M74011) y4yI hrcN+2 511514-512869 451 35-450 YscN 439 U00998 55 73 prob. ATPase involvedin secretion 1-80 HrcN 450 L12251 97 97 105-450 97 98 y4yJ +1512845-513381 178 1-178 ORF7 178 L12251 97 98 hyp. 20.4 kd protein v4yKhrcQ +1 513406-514482 358 171-350 YscQ 307 L25667 27 46 prob.translocation protein inv. in secretion processes 1-358 HrcQ 382 L1225196 98 (FliN/MopA/SpaO family) y4yL hrcR +2 514475-515143 222 6-216 YscR217 L25667 46 66 prob. translocation protein inv. in secretion processes1-222 HrcR 249 L12251 99 99 (FliP/MopC/SpaP family) y4yM hrcS +1515143-515418 91 1-66 YscS 88 L25667 34 65 prob. translocation proteininv. in secretion processes 1-91 HrcS 92 L12251 98 100 (FliQ/MopD/SpaQfamily) y4yN hrcT +3 515427-516245 272 28-250 YscT 261 L25667 31 52prob. translocation protein inv. in secretion processes 1-272 HrcT 272L12251 98 99 (FliR/MopE/SpaR family) y4yO hrcU +2 516242-517279 3455-339 YscU 354 L25667 30 50 prob. translocation protein inv. insecretion processes 1-340 HrcU 351 L12251 99 99 (FlhB/HrpN/YscU/SpaSfamily) y4yP +1 518077-518892 271 35-262 HipA 295 M19019 88 91 homologis inducible by root-exudate and diadzein; frameshift acc. to homologue:518855 (1>2) fy1 + 519655-519995 NolJ 148 L26967 nodulation genehomologous fragments (80-100% id. in 97 aa); frameshifts acc. tohomologue: 519789 (1>3); 519900 (3>2); 519965 (2>3) y4yQ +2520280-521170 296 hyp. 31.3 kd integral membrane protein y4yR +2521360-523453 697 17-677 LcrD 704 M96850 40 65 prob. translocationprotein inv. secretion processes [Flage11a/HR/Invasion proteins exportpore (FHIPEP) family] y4yS +3 523470-524018 182 hyp. 20.1 kd proteiny4zA +2 525005-525892 295 34-115 Y4rM 350 this work 98 98 hyp.(fragmentous?) 32.9 kd protein; put. frameshift: 133-231 Y4rL 155 thiswork 99 99 525699 (2>3); similar to Y4iE y4zB +1 526051-527121 35660-320 Tnp 377 X67862 29 47 put. (fragmentous?) transposase (IS4 family)526103- 526200 higher cod. prob. in fr. 2; put. frameshift: 526200 (2>1)fz1 + 527337-527902 Hdc 378 J02577 fragments homologous to histidinedecarboxylases (30- 45% id. in 134aa); put. frameshift (3>2) around527478 y4zC +3 529125-529910 261 65-248 AvrPph3 276 M86401 27 41 hyp.28.3 kd protein; hom. to avirulence protein y4zD +3 530145-530294 49hyp. 5.5 kd protein fz4 +2 530432-530764 110 1-110 Y4jA 504 this work 7285 hom. to C-terminus of Y4jA/Y4nE/Y4sE fz2 + 530761-531250 ORFB 251X67861 put. IS-ATF-binding protein fragments (32-40% id. in 137aa); put.frameshift acc. to homolog: 531062 (1>2) y4zF syrM2 +2 532676-533695 3391−320 SyrM 326 M33495 69 81 prob. symbiotic regulator (LysR family)1−335 SyrMl 338 this work 62 79 fz3 + 534257-534422 ORF 338 M73488fragments homologous to 1-aminocydopropane-1- cabboxylate deaminase(63-83% id. in 56aa); put. frameshift: 534291 ^(a)open reading frame(ORF) ^(b)strand (−/+) or frame (−1; −2; −3; +1; +2; +3) ^(c)number(no.) ^(d)aminoacids (aa) ^(e)GenBank/EMBL accession numbers^(f)identity (I)and similarity (S) have been calculated by the programmeBESTFIT (local homology algorithm; Smith and Waterman, 1981) of theWISCONSIN SEQUENCE ANALYSIS PACKAGE (version 8.0, GCG, Madison, USA)^(g)abbreviations: prob. = probable; cod. prob. = coding probability;acc. = according; inv. = involved; sim. = similar; id. = identical; fr.= frame; acc. no. = accession number; nt = nucleotide; hyp. =hypothetical; put. = putative; hom. = homologous; dep. = dependent;N/C-term = N/C-terminus

In a second stage, the remaining 436 kb of pNGR234a were analyzed.Several ORFs and their deduced proteins were identified that belong tofunctional groups not previously identified in the analysis of cosmidspXB296, pXB368 and pXB110 (replication of the plasmid, conjugal transferof the plasmid, functions in oligosaccharide biosynthesis and cleavage,functions in sugar or sugar-derivative metabolism, functions in lipid orlipid-derivative metabolism, functions in chemoperception/chemotaxis,functions in biosynthesis of cofactors, prosthetic groups and carriers,etc.).

Although further functional analyses of selected ORFs in pNGR234a stillhave to be performed, large-scale sequencing gives a global picture oftheir genomic organization and possible roles. Determination of putativefunctions of predicted genes by homology searches and identification ofsequence motifs (promoters, nod boxes, nifA activator sequences, andother regulatory elements) will aid in finding new symbiotic genes.High-fidelity sequence data covering long stretches of the genome are aprerequisite for these studies. The use of the dyeterminator/thermostable sequenase shotgun approach has allowed thecompletion of the entire ˜500 kb sequence of pNGR234a and has opened upnew avenues for the genetic analysis of symbiotic function.

Genetic Organization of the Whole Plasmid pNGR234a

Within the complete nucleotide sequence of pNGR234a, which comprises536,165 bp, a total of 416 ORFs were predicted to encode proteins. Anadditional 67 ORF-fragments were detected that seem to be remnants offunctional ORFs.

Thirty four percent (139) of the 416 potential proteins, have no obvioussimilarities to any known proteins. Of the remaining 277 proteins, 31(8%) are similar to proteins for which no biochemical or phenotypic rolehas been assigned, 12 (3%) are similar to proteins for which limitedbiological data is available, and 234 (56%) are similar to proteins witha more precise biological function: enzymes (95), proteins involved inintegration and recombination of insertion elements (44), transporters(32), transcriptional regulators (22), protein secretion/export (21),proteins involved in replication and control of the plasmid (12),electron transporters (6), and proteins involved in chemotaxis (2). Ahigh proportion of enzymes was expected of a symbiotic replicon involvedin nodulation (Nod-factor biosynthesis, etc.) and nitrogen fixation. Asexpected from the observation that NGR234 can be cured of its plasmid(Morrison et al., 1983), no ORFs essential to transcription, translationor to primary metabolism were found.

A large number of protein families are present in several copies onpNGR234a. This is true even after elimination of the many proteins whichare encoded in repeated IS elements, or are involved in transposition,integration or recombination. The most notable examples of highlyrepresented protein families include: five members of the short-chaindehydrogenase/reductase family, one of which (y4vI) contains twohomologous domains; Five complete and one partial ABC-type transporteroperons that each encode for at least one ABC-type permease and anABC-type ATP-binding protein; four cytochrome P450's; and three membersof peptidase family S9A. In total, 85 proteins belong to families thatare represented more than once and which do not seem to be linked toinsertion or recombination.

The majority (330, 79%) of the putative proteins are probably located inthe cytoplasm of the bacterium, 62 (15%) possibly span membranes, 20(5%) could be located in the periplasm, 3 are predicted to belipoproteins that could associate with the outer membrane, and 2 areprobably outer membrane proteins. These observations accord well withthe dominance of biosynthetic proteins, as well as proteins involved intranscriptional regulation and insertion/recombination, most of whichare thought to be cytoplasmic.

Although other start points cannot be excluded, replication of pNGR234aprobably begins at oriV which is located within the intergenic sequence(igs) between the repC and repB-like genes y4cI and y4cJ. This locus(positions 54,417 to 54,570) encodes three proteins with 40-60% aminoacid identities to RepABC of pTiB6S3 (a Ti-plasmid of Agrobacteriumtumefaciens), pRiA4b (an Ri-plasmid of A. rhizogenes) and pRL8JI (acryptic plasmid of R. leguminosarum bv. leguminosarum). Amongstreplication regions, highest identities (69 to 71% at the nucleotidelevel) are found in the igs's between repC and repB (FIG. 5). InAgrobacterium, these igs's are the determinants which render parentalplasmids incompatible. Two ORF's (position 198,500), which arehomologous to pseudomonal genes involved in plasmid stability, may alsoplay a role in replication of pNGR234a. A 12 bp portion of the origin oftransfer (oriT) is identical to that of pTiC58 of Agrobacteriumtumefaciens (nt 80,162 to 80,173), and highly similar to those ofRSF1010 (Escherichia coli) and pTFI (Thiobacillus ferrooxidans). Thissequence corresponds to the oriT of plasmids containing the “Q-typenick-region” (FIG. 6).

Another 24 predicted ORFs show homologies to conjugal transfer genes ofAgrobacterium Ti-plasmids. All are located in two large clusters betweenposition 57,000 to 83,000. Since pNGR234a was believed to benon-transmissible (Broughton et al., 1987), the fact that both thenucleotide sequence of the individual ORFs and their order is similar inAgrobacterium and NGR234 came as a surprise. Conjugal transfer of Tiplasmids in A. tumefaciens is controlled by a family ofN-acyl-L-homoserine lactone auto-inducers (Zhang et al., 1993). Similarmolecules, which are able to interact with the traR gene product of A.tumefaciens, were detected in the supernatants of NGR234 cultures usingthe assay of Piper et al. (1993).

Reiterated sequences first became apparent in NGR234 during theconstruction of an ordered array of cosmid clones (Perret et al., 1991).It is now clear that 97 kbp (18%) of pNGR234a represents insertion-(IS)and mosaic-(MS) sequences (FIG. 7). Homology searches for known IS/MSrevealed some of these, while comparison of repeated sequences withinpNGR234a, as well as between the plasmid and 2,500 random chromosomesequences (V. Viprey, pers. communication) located the rest. Seventyfive putative ORFs (18% of the total) and 40 fragments of ORFs wereidentified this way, nearly half of which (44) show homologies tointegrases and transposases. Many of these IS elements are similar notonly to those derived from Rhizobium and Agrobacterium species, but alsoto those of other, diverse Gram (−) and Gram (+) bacteria (e.g.Bacillus, Escherichia, and Pseudomonas). The shear number and diversityof these IS/MS elements suggests that NGR234 has functioned as a“transposon trap”. This is supported by the fact that their average G,Ccontent (61.5%) is 3% higher than that of pNGR234a (58.5%).Interestingly, many IS/MS are clustered between positions 300,000 to390,000 (FIG. 7), while some loci are almost unaffected by insertions(oriV, nod-, fix- and nif-ORFs). Small IS/MS clusters divide thereplicon into large blocks of often functionally related ORFs (e.g.blocks of nod-ORFs replication and conjugal transfer ORFs, nif-ORFs andfix-ORFs). A list of all sequences with IS-elment or mosaic sequencecharacter is given in Table 4. Although transposition of these IS/MSelements has not been demonstrated, transfer of plasmids amongstrhizobia in the legume rhizosphere (Broughton et

al., 1987) and to other non-symbiotic bacteria in fields (Sullivan etal., 1995) suggests that lateral transfer of genetic information hashelped shape symbiotic potential.

TABLE 4 Insertion/mosaic sequences in pNGR234a put. ORFs/ ORF- start ofstop of name of fragments similarities homologous sequences in regionregion region included within pNGR234a similarities to chromosome otherorganisms/comments  17000  17600 ISH-10b y4aQ 33% aa-id. to y4hPgeneproducts from IS866 and IS66 from (ISH-10a) Ag. tumefaciens  18900 19661 ISH-11b fa2 54% aa-id. to part of Tnp of IS1202 from Str.pneumoniae y4bF (ISH-11a); 19096-19362: 91% nt-id. to ISH-11c  19666 22981 NGRIS-4a y4bABCD identical to NGRIS-4b many copies on thechromosome  22985  25400 ISH-11a fb1, y4bF y4bF: sim. to fb1 andpartially 91% nt-id. to Tnp of IS1202 from Str. pneumoniae fa2 (ISH-11b)chromosomal sequences  32463  35085 NGRIS-3a y4bLM identical toNGRIS-3b/c copie(s) on the 62% nt-id. (over 2352 nt) to IS1162 ofchromosome Ps. fluorescens (IS21/IS1162/IS408 family)  49300  50300ISH-13a y4cG similar to y4lS (ISH-13b) DNA invertase  69936  70385ISH-4c fd1 70233-70385: 93% nt-id. ORFA of IS5376 from B.stearothermophilus to part of NGRIS-4  93322  96025 ISH-12a fe2, y4eF,93574-94927: 90% Tnp (fe2) and Int (4AeF, fe3) from Weeksella fe5, fe3nt-id. to ISH-12b1; 75% zoohelcum -IS-element; (93322-94586: 57% nt-id.to fq6 region (ISH- nt-id. to IS292 from Ag. radiobacter); “phage”12b2); 95343-95558: Integrase family (Y4eF, fe5, fe3) 88% nt-id. toISH-12b3 101939 102394 ISH-8b fe7 84% nt-id. to ISRm5 of R. meliloti;fe7: mutator family of transposases 115881 116004 MSH-14b partiallyhomologous to 72-73% nt-id. to sequences mosaic element ISH-14adownstream from chvl/up- stream from rpoN on the chromosome 124396124500 MSH-14a partially homologous to 82% nt-id. to sequence mosaicelement ISH-14b RIME1 downstream from chvl on the chromosome; parts ofMSH-14a show 73-89% nt-id. to chromo- somal sequences 126806 127369ISH-12f y4gA low. similarity to y4rE 127900 128500 ISH-12e y4gCrecombinase from pAE1 of Al. eutrophus (“phage” integrase family) 131000131800 ISH-15 y4gE* partially 87% nt-id. to chromosomal sequences 159781160564 ISH-16 96% nt-id. to repetitive sequence from R. fredii USDA257(acc. no. M73698) 164600 167700 ISH-10a y4hNOPQ 99% nt-id. of parts ofdifferent ORFs derived from IS-like sequences; y4hPQ to chromosomalpartially known as acc. no. X74068 (“Region2” sequences from pNGR234a);164853-167086: 66% nt-id. to IS66 from Ag. tumefaciens 168208 169190ISH-2c fi1, fi2, fi3 168343-168659: 72% 168208-168383: 70 nt-id. toISRm2011-2 nt-id. to ISH-2f1/ (R. meliloti); fi2/3: IS1111A, IS1328,IS1533 ISH-2d1 family of transposases 165785-169091: 73% nt-id. toISH-2f2/ ISH-2d2 173295 173702 ISH-8g y4iE* y4iE: sim. to y4rL, y4zA,and fr2 175590 175909 ISH-11c y4iG* 175643-175909: 91% nt-id. to ISH-11a185672 186507 ISH-2d y4iO*/P* 185672-186075(−): 73% Y4iO: Tnp of IS1325from Y. enterocolitica (3′-end) nt-id. to ISH-2c2(+) (IS1111A, IS1328,IS1533 family) 186208-186507(−): 72% nt-id. to ISH-2c1(+) 187112 189752NGRIS-5a y4iQjA identical to NGRIS-5b/c copie(s) on the 1stA and B(Tnps) of IS1326 from E. coli chromosome (IS21/IS1162/IS408 family)190000 193500 ISH-10c y4jBCD(E*) 38/32 aa-id. of y4jCD different ORFsderived from IS-like sequences; to y4hOP (ISH10a) partially 60% nt-id.to IS866 (Ag. tumefaciens); IS292 (Ag. radiobacter); ISR11 (R.leguminosarum) 193518 193634 MSH-17 76% nt-id. to repetitive sequenceRMX6 from Myxococcus xanthus (acc. no. M60865) 199746 199958 ISH-11dy4jM* similarity to fb1 and y4bF (ISH-11a) 211165 211265 ISH-10g 74%nt-id. to ISR11 (R. leguminosarum), IS66/IS866 derivative 211350 212580ISH-10h fk2 similar to y4jD 74% nt-id. to IS66 (ISH-10c) 217564 220186NGRIS-3b y4kIJ identical to NGRIS-3a/c copie(s) on the 62% nt-id. (over2352 nt) to IS1162 of chromosome Ps. fluorescens (IS21/IS1162/IS408family) 224547 224995 ISH-18a y4kQ 83% nt-id. to ISH18b IS110 family(3′-end) (427651-428102) 240800 241040 ISH-24b fl2 60% nt-id. to ISR12from R. leguminosarum 244540 244851 ISH-19a fl4 244620-244812: 97% TnpAfrom Tn163 (R. leguminosarum) nt-id. to ISH-19b 248290 248655 ISH-19bfl5 248463-248655: 97% nt-id. to ISH-19a 248814 249680 ISH-20 fl6 Tnp ofTn1546 (Enterococcus faecium; Tn21/501/1721 family) 251407 252400ISH-13b y4lSmA y4lS: similar to y4cG y4lS: invertase; 58% nt-id.(251409-252211) to (ISH-13a) Tn501 from Ps. aeruginosa (acc. no. Z00027)258551 258657 MSH-21 mosaic sequence: 82% nt-id. to sequence up- streamof ropA2 (R. leguminosarum; acc. no. X80794) 280403 283043 NGRIS-5by4nDE identical to NGRIS-5b/c copie(s) on the 1stA and B (Tnps) ofIS1326 from E. coli chromosome (IS21/IS1162/IS408 family) 284722 284985ISH-1b fn2 60% nt-id. to IS1162 (Ps. fluorescens, IS21/IS1162/IS408family) 300017 300819 ISH-1c fo3 61% nt-id. IS408 (Ps. cepacia;IS21/IS1162/IS408 family) 300820 304117 NGRIS-6 fo4/5/6, 77% nt-id. toNGRIS-4 y4oL/M/N 304118 304434 ISH-1d fo7 61% nt-id. IS408 (Ps. cepacia;IS21/IS1162/IS408 family) 318854 319686 NGRIS-7 fp1-2 66% nt-id. toIS1248 of Pa. denitirificans 320456 328935 NGRRS-1a fp3/4; y4pL 3 copieson the interrupted by NGRIS2a and 4b; fp3/4, Y4pL: chromosomeIS21/IS1162/IS408 family 320590 323147 NGRIS-2a y4pEFG identical toNGRIS-2b partially 88-90% nt-id. to repetitive sequence RDRS9 of R.fredii USDA257 (IS1111A/ IS1328/IS1533 family) 323961 327276 NGRIS-4by4pHIJK identical to NGRIS-4a many copies on the chromosome (disruptsall 4 copies of NGRRS-1) 335004 336301 NGRIS-8 y4pO similar to fe7(ISH-8b) 88% nt-id. to ISRm3 of R. meliloti: mutator family oftransposases 342272 342419 ISH-12d 342272-342419: 87% nt-id. to ISH-12b4344100 345300 ISH-2e y4qE Tnp (Leptospira borgpetersenii):IS1111A/IS1328/IS1533 family 345755 346133 ISH-12c fq4 345755-346133:82% Int (XerC, E. coli): “phage” integrase family nt-id. to ISH-12b5351600 351735 MSH-22 80 nt-id. to sequence from pTiS4 (Ag. vitis; acc.no. M91609) 351826 353794 ISH-10d y4qI, fq5 fq5: 35% aa-id. to y4hQ71-95% nt-id. of parts of 67% nt-id. to ISR11 (R. leguminosarum; acc.(ISH-10a) y4qI to chromosomal no. L19650); IS866/66 homolog sequences354000 363073 ISH-12b y4qJK, fq6, 354942-35612/356215- 70% nt-id. ofparts of Tnp and Int from Weeksella zoohelcum -IS-ele- y4rABCDEF 356383:90/91% nt-id. to ISH14a1 to chromosomal ment (y4qJK), differentintergrases (y4rAB), ISH12a1; 75% nt-id. to sequences integrase XerC ofH. influenzae (y4rC); fe5 region (ISH-12a2); y4qK, fq6, y4rABCDEF:“phage” integrase 359753-359968: 88% family nt-id. to ISH-12a3;361029-361410: 82% nt-id. to ISH-12c 362507-362654: 87% nt-id. toISH-12d 363287 363694 ISH-10i y4rG low similarity to y4jB unknownprotein from IS1312 (Ag. and fr4 (ISH-10c/i) tumefaciens)/IS866 366252367402 ISH-8f fr1, fr2 366252-366524: 88% nt-id. to ISH-8e366773-366953: 92% nt-id. to ISH-8g 367699 367970 ISH-10e fr3 56% aa-id.of fr3 to 75% nt-id. to IS66 (Ag. tumefaciens) y4hO (ISH-10a) 368503369675 ISH-23 y4rI 91-93% nt-id. of parts of y4rI to chromosomalsequences 369697 370887 ISH-2f y4rJ 370012-370328: 72% y4rJ: Tnp fromIS1111a of Coxiella burnetii nt-id. to ISH-2c1 (IS1111A/IS1328/IS1533family) 370479-370785: 73% nt-id. to ISH-2c2 371399 372990 ISH-8ey4rL*M* 371399-371671: 88% nt-id. to ISH-8f 371474-372228: 97% nt-id. toISH-8d 377185 377695 ISH-10j fr4 similar to y4rG (ISH-10i)377327-377695: 75% nt-id. to ISRm6 (R. meliloti) 377826 380383 NGRIS-2by4sABC identical to NGRIS-2a partially 88-90% nt-id. to repetitivesequence RFRS9 of R. fredii USDA257 (IS1111A/ IS1328/IS1533 family)380883 383523 NGRIS-5c y4sDE identical to NGRIS-5a/b copie(s) on thechromo- 1stA and B Tnps) of IS1326 from E. coli some (IS21/IS1162/IS408family) 383593 384054 ISH-2g fs5 Tnp of IS1328 of Y. enterocolitica(IS1111A/IS1328/IS1533 family) 384210 384493 ISH-10k fragments with94-84% nt-id. to ISRm6 (R. meliloti) 388100 388600 ISH-2h fs1 differentTnps (IS1111A/IS1328/IS1533 family) 388601 388900 ISH-10l fs2 ORF fromIS1312 of Ag. tumefaciens (IS66/866 family) 396445 397301 NGRIS-9 y4sNand 91-99% nt-id. of NGRIS9- different ORFs derived from IS elements;fs4 parts to chromosomal partially known from acc. no. X74314 sequences400626 403248 NGRIS-3c y4tAB identical to NGRIS-3a/b copie(s) on thechromo- 62% nt-id. (over 2352 nt) to IS1162 of some Ps. fluorescens(IS21/IS1162/IS408 family) 426525 428102 ISH-18b y4uE* 427651-428102:83% 77-96% nt-id. of ISH-18b- Tnp of mini-circle DNA from Str.coelicolor nt-id. to ISH-18a parts to chromosomal (IS110 family)sequences 429860 430007 ISH-8c fu3 85% nt-id. to ISRm5 (R. meliloti)430568 432851 ISH-1e y4uHI 60% nt-id. to IS408/IS1162 (Ps. cepacia/Ps.fluorescens) 433222 433560 ISH-24a fu4 low similarity to y4sN 79% nt-id.to ISRm4 (R. meliloti)/ISR12-like (NGRIS-9) 462554 463053 ISH-10ffragments with 83-69% nt-id. to IS866 (Ag. tumefaciens) 524946 525892ISH-8d y4zA 525095-525849: 97% 524946-525580: 61% nt-id. to ISRm5 (R.nt-id. to ISH-8e meliloti) 526051 527121 ISH-25 y4zB* Tnp of IS5376 fromB. stearothermophilus (IS4 family of transposases) 530364 531249 ISH-1ffz4, fz2 79% nt-id. to part of fz4/2: IS21/IS1162/IS408 family NGRIS-5Abbreviations: Tnp = transposase; Int = integrase; nt-id. =nucleotide-identity; aa-id. = aminoacid identity IS elements withprecisely defined borders are designated as NGRRS/NGRIS-1 to 9. Othersequences which show homologies to known mosaic or IS-like sequences(mosaic/insertion sequence homologs) are named MSH and ISH,respectively.

Carbohydrates are constituents of the rhizobial cell wall as well asmorphogens called Nod-factors (short tri- to penta-mers ofN-acetyl-D-glucosamine, substituted at the non-reducing terminus withC16 to C18 saturated or partially unsaturated fatty acids). Elements ofthe biosynthetic pathways leading to cell walls or tolipo-chito-oligosaccharides (Nod-factors) are common. Most differencesare found in the later stages of the pathways that lead to specificcell-wall components or to Nod-factors.

As befits a symbiotic replicon, only 13 ORF's with homology topolysaccharide synthesis genes (house-keeping genes senso stricto) arelocated on the plasmid (Table 3). Sequences homologous to exoB, exoF,exoK, exoL, exoP, exoU, and exoX (X. Perret and V. Viprey, unpublished),and exoY (Gray et al., 1990) are clearly located on the chromosome.Although loci with weak homologies to nod-box::psiB of R. leguminosarum,and exoX of R. meliloti exist on the plasmid (y4iR, and y4xQrespectively), these are regulatory rather than structural genes,suggesting that almost all cell wall polysaccharide synthesis ORFs arechromosomally located.

Except for nodPQ and nodE, at least one copy of all the regulatory andstructural ORFs involved in Nod-factor biosynthesis seem to be locatedon the plasmid. The activity of most nodulation genes is modulated byfour transcriptional regulators of the lysR family. These are nodD1(y4aL), syrM1 (y4pN), nodD2 (y4xH), and syrM2 (y4zF). NodC, which is anN-acetylglucosaminyltransferase. the first committed enzyme in theNod-factor biosynthetic pathway, is part of an operon which includesnodABCIJnolOnoeIE (y4hI to y4hB, Table 3). Together, these genes, whichform the hsnIII locus, are responsible for the synthesis of the coreNod-factor molecule, and the adjunction of 3- (or 4)-O-carbamoyl,2-O-methyl, and 4-O-sulfate groups (Hanin et al., unpublished). nodZ(y4aH), which encodes a fucosyltransferase, is part of the hsnI locus,which includes noeJ (y4aJ), noeK (y4aI), noeL (y4aG), nolK (y4aF), allof which are involved in the fucosylation of NodNGR factors (Fellay etal., 1995a). Wild-type NodNGR factors are also N-methylated and6-O-carbamoylated, adjuncts which are added by the transferases encodedby nodS and nodU respectively [y4nC and y4nB; hsnII (Lewin et al.,1990)]. Possibly the only other enzyme which may be directly involved inNod-factor biosynthesis is that encoded by nolL (y4eH, Table 3). As the2-O-methylfucose residue of NGR234 Nod-factors is either 3-O-acetylated,or 4-O-sulphated, an acetyltransferase is obviously required. Since NolLshows only limited homology to acetyltransferases, experimental proof ofthe transferase activity will be required however.

In contrast to R. leguminosarum and R. meliloti harbouring pNGR234a, A.tumefaciens(pNGR234a) transconjugants are incapable of nitrogen fixation(Broughton et al., 1984), suggesting that some essential fix ORFs arealso carried by the chromosome Nevertheless, more than 40 nif- andfix-ORFs are plasmid borne. Included amongst these are nifA (y4uN) whichencodes for a sigma-54 dependent regulator. Mutation of rpoN (whichencodes sigma 54) causes a Fix⁻ phenotype on NGR234 hosts (van Slootenet al., 1990). Similarly, mutation of fixF (y4gN) disrupts synthesis ofa rhamnose-rich extra-cellular polysaccharide, and results in a Fix⁻phenotype on Vigna unguiculata, the reference host for NGR234(unpublished). In fact, loci adjacent to fixF are probably responsiblefor the synthesis of dTDP-rhamnose from glucose-1-phosphate. Enzymesinvolved in this biosynthetic pathway include glucose-1-phosphatethymidylyltransferase (y4gH), dTDP-glucose-4,6-dehydratase (y4gF),dTDP-4-dehydrorhamnose-3,5-epimerase (y4gL), and dTDP-4-dehydrorhamnosereductase (y4gG). Rhamnose-rich lipopolysaccharides (LPS) seem to benecessary for complete bacteroid development and nitrogen fixation(Krishnan et al., 1995). Perhaps the enzyme encoded by y4gI is neededfor the synthesis of the rhamnose rich LPS's from dTDP-rhamnose.

Although not directly involved in the fixation process, mutation of theplasmid borne copy of dctA (=dctA1, y4vF) also impairs nitrogen fixation(van Slooten et al., 1992). Other nif- and fix-ORFs are involved inelaboration of the electron-transfer complex (fixAB), in variouscofactors required for nitrogen fixation (e.g. fixC, nifB, nifE, nifN,etc.), and in the synthesis of ferrodoxins (fdxB, fdxN, fixX). Finally,those ORFs involved in the synthesis of the nitrogenase complex are alsopresent. Amongst these are two functional copies of the nifKDH ORFs(y4vM to y4vK and y4xC to y4xA) (Badenoch-Jones et al., 1989).Additionally, 17 new ORFs located within the nitrogen fixation cluster(see FIG. 7; ORFs y4vC to y4vJ with the exception of dctA1, y4wA toy4wG, y4wI, y4wJ and y4xQ) are co-transcribed together with the ORFshomologous to known nif and fix genes. It thus seems likely that mostORFs necessary for bacteroid development and synthesis of thenitrogen-fixing complex, are carried by pNGR234a.

Two types of regulatory elements which frequently occur in pNGR234a arethe NodD- and NifA/sigma-54-dependent promoters. NodD-dependentpromoter-like sequences known as nod boxes have been identified byhomology search within intergenic regions, using the following consensussequence: 5′-YATCCAYNNYRYRGATGNNNNYNATCNAAACAATCRATTTTACCAATCY-3′ [12mismatches allowed (van Rhijn and Vanderleyden, 1993); Y=C or T, R=A orG, N=A,C,G or T]. Putative NifA-dependent promoters (Fischer, 1994) havebeen predicted by screening for the NifA activator sequence(5′-TGT-N₁₀-ACA-3′) together with the sigma-54 promoter consensussequence (5′-TGGCAC-N₅-TTGCA/T-3′ with GG and GC as the most conserveddoublets; 3 mismatches allowed) separated by 60 to 150 nucleotides. Theidentified conserved promoter-like sequences in pNGR234a are listed inTables 5 and 6.

TABLE 5 nod box-like sequences in pNGR234a number of distance namemismatches to to the of the nod position in orien- the consensusfollowing following box pNGR234a tation sequence ORF ORF  1 4514-4562 −11 504 (fal)  2 8481-8529 − 8  87 nodZ  3 12322-12370 − 7 — ?#  497470-97518 − 6 277 nolL  5 129615-129663 + 10 1358  y4gE  6141088-141136 + 8 890 fixF  7 150280-150327 − 11 202 noeE  8158820-158868 − 4 235 nodA  9 161891-161939 + 11 1103  y4hM 10169833-169881 − 7 117 y4iR 11 278947-278995 − 7 153 nodS 12279821-279869 + 7 — ?# 13 443101-443149 − 10 465 y4vC 14 473059-473107 +9 236 y4wH 15° 481253-481301 − 16 117 y4wM 16 493961-494009 + 6 288 y4xI17 532039-532087 + 5 589 syrM2 18 256434-256482 + 12 329 y4mC 19469151-469199 + 12 112 y4wE °The majority of the mismatches is locatedin the 3′-terminal part of the sequence. #No predicted ORF can be founddownstream of the putative nod box.

TABLE 6 Putative NifA-dependent promoters in pNGR234a distance namesigma-54 pro- to the of the NifA-dep. moter (−12/−24 orien- followingfollowing Nr. UAS*: position region#): position tation ORF (nt) ORF  190812-90827 90910-90924 + 127  y4eD  2 162727-162742 162788-162802 +240  y4hM  3 235036-235051 234934-234948 − 66 y4lD  4 255021-255036255130-255144 + 306  y4mB  5 285265-285280 285343-285357 + 50 y4nG  6436363-436378 436275-436289 − 41 nifB  7 442046-442061 441955-441969 −56 fixA  8 442735-442750 442676-442690 − 40 y4vC  9 444109-444124443983-443997 − 104  y4vD 10 444137-444152  444241-444299° +  38° nifQ11 451782-451799 451891-451905 + 88 nifH1 12 460319-460334460424-460438 + 63 y4vR 13 463063-463078 463139-463153 + 48 y4wA 14478839-478854 478761-478775 − 463  nifS 15° 483663-483678483769-483783 + 88 nifH2 *“Upstream Activator Sequence”: NifA-bindingsite located 80 to 150 nt upstream of the transcription start point(5′-TGT-N₁₀-ACA-3′). #sequence corresponding to the consensus sequenceof conserved sigma-54-promoters 12 nt upstream of the transcriptionstart point: 5′-TGGCAC-N₅-TTGC-3′ (2 mismatches allowed). °3possibilities for a promoter (in two cases only corresponding to theminimal consens: 5′-GG-N₁₀-GC-3′)

EXAMPLES Example 1

General Methods

Bacteria and Plasmids

Escherichia coli was grown on SoC, in TB or in two-fold YT medium(Sambrook et al., 1989). The cosmid clones pXB296 and PXB110 (Perret etal., 1991) were raised in E. coll strain 1046 (Cami and Kourilsky,1978). Subclones in M13mp18 vectors (Yanisch-Perron et al., 1985) weregrown in E. coli strain DH5αF′IQ (Hanahan, 1983).

Construction of Cosmid Libraries

Cosmid DNA was prepared by standard alkaline lysis procedures followedby purification in CsCl gradients (Radloff et al., 1967). DNA fragmentssheared by sonication of 10 μg of cosmid DNA were treated for 10 min at30° C. with 30 units of mung bean nuclease (New England Biolabs,Beverly, Mass., USA), extracted with phenol/chloroform (1:1), andprecipitated with ethanol. DNA fragments, ranging in size from 1 to 1. 4kbp, were purified from agarose gels using Geneclean II (Bio101, Vista,Calif., USA) and ligated into SmaI-digested M13pm18. Electroporation ofaliquots of the ligation reaction into competent E. coli DH5αF′IQ wasperformed according to standard protocols (Dower et al., 1988; Sambrooket al., 1989).

M13 Template Preparation

Fresh 1 ml E. coli cultures in twofold YT held in 96-deep-wellmicrotiter plates (Beckman Instruments, Fullerton, Claif., USA) wereinfected with recombinant phages from white plaques grown on platescontaining X-gas (5-bromo-4-chloro-indoyl-β-D-galactoside) and IPTG(isopropyl-β-thiogalactopyranoside). Rapid preparation of ˜0.5 μg ofsingle-stranded M13 template DNA was carried out as follows: 190 μlportions of the phage cultures grown for 6 hr at 37° C. were transferredinto 96-well microtiter plates. Lysis of the phages was obtained byadding 10 μl of 15% (w/v) SDS followed by 5 min incubation at 80° C.Template DNA was trapped using 10 μl (1 mg) of paramagnetic beads(Streptavidin MagneSphere Paramagnetic Particles Plus M13 Oligo,Promega, Madison, Wis., USA) and 50 μl of hybridization solution [2.5 MNaCl, 20% (w/v) polyethylene glycol (PEG-8000)] during an annealing stepof 20 min at 45° C. Beads were pelleted by placing microtiter plates onappropriate magnets and washing three times with 100 μl of 0.1-fold SSC.The DNA was recovered in 20 μl of water by a denaturation step of 3 minat 80° C. When required, larger amounts of single-stranded recombinantDNA (>10 μg) were purified using QIAprep 8 M13 Purification Kits(Qiagen, Hilden, Germany) from 3 ml of supernatant of phage culturesgrown for 6 hr at 37° C.

Sequencing

Two sequencing methods were used: dye terminator and dye primer cyclesequencing, each in combination with AmpliTaq DNA polymerase(Perkin-Elmer) and Thermo Sequenase (Amersham). All reactions, includingethanol precipitation, were performed in microtiter plates. Reagentswere pipetted using 12-channel pipettes. Where necessary, sequencingreaction mixtures, including enzymes, were pipetted into the plates inadvance and held at −20° C. until needed.

Dye Terminator Cycle Sequencing

For dye terminator/AmpliTaq DNA polymerase sequencing, 0.5 μg oftemplate DNA, and the PRISM Ready Reaction DyeDeoxy Terminator CycleSequencing Kit (Perkin-Elmer) were used. Cycle sequencing was performedin microtiter plates using 25 PCR cycles (30 sec at 95° C., 30 sec at50° C., and 4 min at 60° C.). Prior to loading the amplified products onelectrophoresis gels, unreacted dye terminators were removed usingSephadex columns scaled down to microtiter plates (Rosenthal andCharnock-Jones, 1993).

Dye terminator/Thermo Sequenase sequencing was performed using the sameexperimental conditions except that the reaction mix contained 16.25 mMTris-HCl (pH 9.5), 4.0 mM MgCl₂, 0.02% (v/v) NP-40, 0.02% (v/v) Tween20, 42 μM 2-mercaptoethanol, 100 μM dATP/dCTP/dTTP, 300 μM dITP, 0.017μM A/0.137 μM C/0.009 μM G/0.183 μM T from Taq Dye Terminators(Perkin-Elmer; no. A5F034), 0.67 μM primer, 0.2-0.5 μg of template DNA,and 10 units of Thermo Sequenase (Amersham) in a 30 μl reaction volume.Unincorporated dye terminators were removed from reaction mixtures byprecipitation with ethanol.

Dye Primer Cycle Sequencing

Dye primer/AmpliTaq DNA polymerase sequencing reactions were performedaccording to the instructions accompanying the Taq Dye Primer, 21M13 Kit(Perkin-Elmer). Cycle sequencing was carried out on 0.5 μg of templateDNA with 19 PCR cycles (30 sec at 95° C., 30 sec at 50° C., and 90 secat 72° C.) followed by six cycles, each consisting of 95° C. for 30 secand 72° C. for 2.5 min. Prior to electrophoresis, the four base-specificreactions were pooled and precipitated with ethanol.

Identical PCR conditions and the Thermo Sequenase Fluorescent LabelledPrimer Cycle Sequencing Kit (Amersham) were used for dye primer/ThermoSequenase sequencing reactions.

Sequence Acquisition and Analysis

Gel electrophoresis and automatic data collection were performed withABI 373A DNA sequencers (Perkin-Elmer). After removing cosmid vector andM13mp18 sequences from the shotgun sequence data, the data wereassembled using the program XGAP (Dear and Staden, 1991) and editedagainst the fluorescent traces. To close remaining gaps, to makesingle-stranded regions double-stranded, and to clarify ambiguities,additional cycle sequencing reactions with selected shotgun templateswere carried out using either custom-made primers (primer-walks) oruniversal primer.

The complete double-stranded DNA sequence of cosmid pXB296 was analyzedusing programs from the Wisconsin Sequence Analysis Package (version 8,Genetics Computer Group, Madison, Wis., USA). Homology searches wereperformed with BLAST (version 1.4; Altschul et al., 1990) and FASTA(version 2.0; Pearson and Lipman, 1988). Several nucleotide and proteindatabases were screened (GenBank/Genpept, SwissProt, EMBL, and PIR).Identities and similarities between homologous amino acid sequences werecalculated with the alignment program BESTFIT (Smith and Waterman,1981).

Example 2

Comparison of Fluorescent Traces Created by Different Cycle SequencingMethods

When using a thermostable sequenase [Thermo Sequenase (Amersham)], theconcentrations of dye terminators (Perkin-Elmer) can be reduced by 20-to 250-fold in comparison to the concentrations needed for Taq DNApolymerase without compromising the quality of the sequencing results(Table 7).

To compare the dye terminator and dye primer cycle sequencingprocedures, representative templates derived from the pXB296 librarywere sequenced by both methods, each performed with Thermo Sequenase andTaq DNA polymerase

TABLE 7 Concentrations (In μM) of dye terminators in each cyclesequencing reaction with two different thermostable DNA polymerases DyeAmpliTaq DNA Thermo Sequenase Dilution factor for terminator polymeraseDNA polymerase dye terminators^(a) A Taq 0.751 0.017 40 C Taq 22.5000.137 160 G Taq 0.200 0.009 20 T Taq 45.000 0.183 250 ^(a)ThermoSequenase vs. AmpliTaq.

(FIG. 1). In general, dye terminator traces do not contain the manycompressions (on average, one compression every 50 bases in singlereads) that are common with dye primers if mixes do not containnucleotide analogues like deoxyinosine or 7-deaza-deoxyguanosinetriphosphates or if sequencers are used without active heating systems.In addition, dye terminator traces obtained with Thermo Sequenase showmore uniform signal intensities over those obtained with Taq DNApolymerase, thus resulting in a reduced number of weak and missing peaks(e.g. a weak G-signal following an A-signal in Thermo Sequenase tracesor a weak C-signal following a G-signal in Taq DNA polymerase traces).Using ABI 373A sequencers, errors in automatic base-calling of ThermoSequenase/dye terminator scans only arise after 300-350 bases. Theaverage number of resolved bases in dye primer gels (378 bases) is 46bases longer than in those produced with dye terminators (332 bases).Furthermore, in Thermo Sequenase/dye primer sequences the peaks are veryregular and the number of stops and missing bases decreases incomparison to Taq DNA polymerase/dye primer electropherograms. Thenumber of compressions, however, is not significantly reduced.

Example 3

Shotgun Sequencing of Entire Cosmids Using Dye Terminators or DyePrimers

To compare the efficiency of both methods, cosmid pXB296 of pNGR234a wasshotgun sequenced using a combination of dye terminators andthermostable sequenase (Thermo Sequenase), whereas another cosmid,pXB110, was sequenced using a combination of dye primers and Taq DNApolymerase (Table 1). Over 93% (736 clones) of 786 dye terminator readsof pXB296 were accepted by XGAP with a maximal alignment mismatch of 4%.By increasing this level to 25%, so that most of the remaining datacould be included in the assembly, 775 reads led to three 6 to 10 kbpstretches of contiguous sequence (contigs), two of which were joinedafter editing. To close the last gap and to complete single-strandedregions with data derived from the opposite strand, only 32 additionaldye terminator reads using custom-made primers were required. It took <1week to assemble and finalize the 34,010 bp DNA sequence of pXB296 (EMBLaccession no. Z68203; eight-fold redundancy; GC content, 58.5 mol %).

In contrast, only 308 (34%) of 899 shotgun reads obtained by Taq DNApolymerase/dye primer cycle sequencing of pXB110 were included in thefirst assembly (4% alignment mismatch). At the 25% alignment mismatchlevel, 879 reads were assembled, leading to 25 short contigs (1-2 kbp).These contigs had to be edited extensively in order to join most ofthem. “Primer walks”, covering gaps and complementing single-strandedregions, were not sufficient to clarify all the remaining ambiguities inthe assembled sequence. Every 100-150 bp, a compression in one strandcould not be resolved by sequence data from the complementary strand.Therefore, it was necessary to resequence clones using dye terminatorsand universal primer. In total, 191 additional dye terminator reads hadto be created. As a result, assembling and finalizing the 34,573 bpsequence of pXB110 (10.5-fold redundancy; GC content, 58.3 mol %) tookmuch more time than pXB296 did.

Example 4

Analysis of Cosmid pXB296

Putative ORFs were located on the 34,010 bp sequence of pXB296 using theprograms TESTCODE (Fickett, 1982) and CODONPREFERENCE (Gribskov et al.,1984), the latter in combination with a codon frequency table based onpreviously sequenced genes of Rhizobium sp. NGR234 (as well as theclosely related R. fredii). All 28 ORFs and their deduced amino acidsequences exhibited significant homologies to known genes and/orproteins. The positions of the ORFs along pXB296, as well as the besthomologues, are displayed in Table 2 and FIG. 2. Ribosomal bindingsite-like sequences (Shine and Dalgarno, 1974) precede each putative ORFexcept for ORF9 (position 11,214-12,455). If one disregards the homologyto known glutamate dehydrogenases in the first 32 amino acids deducedfrom this ORF, a downstream alternative start codon (position 11,220)preceded by a Shine-Dalgarno sequence can be identified. Most of theORFs are organised in five clusters (ORFs with only short intergenicspaces or overlaps between them). Cluster I, containing ORF1 to ORF5,encodes proteins homologous to trans-membrane and membrane-associatedoligopeptide permease proteins and to a Bacillus anthracis encapsulationprotein. Cluster II, includes ORF6 and ORF7, which are homologous toaminotransferase and (semi)aldehyde dehydrogenase genes. Homologies totransposase genes [ORF8; cluster III (ORF10 and ORF11)] and to variousnif and fix genes [cluster IV (ORF12 to ORF20); ORF23, part of clusterV] are also reported.

Presumed promoter and stem-loop sequences that might representρ-independent terminator-like structures (Platt, 1986) are shown in FIG.2. Significant σ⁵⁴-dependent promoter consensus sequences(5′-TGGCACG-N₄-TTGC-3′; Morett and Buck, 1989), as well as nifA upstreamactivator sequences (5′-TGT-N₁₀-ACA-3′; Morett and Buck, 1988), arefound upstream of the nifB homologue ORF15, the fixA homologue ORF20,ORF21, ORF22, and ORF23. ORF23 is part of cluster V in pXB296, whichincludes the dctA gene of Rhizobium sp. NGR234 (van Slooten et al.,1992). Surprisingly, the published dctA sequence shows importantdiscrepancies. Therefore, a fragment encompassing this locus wasamplified by PCR using NGR234 genomic DNA as template. By sequencingthis fragment, the cosmid sequence of the present invention wasconfirmed.

Example 5

Analysis of the Complete Plasmid pNGR234a

Using the thermostable sequenase/dye terminator cycle sequencing methodherein described, 20 overlapping cosmids (including pXB296) of thesymbiotic plasmid pNGR234a of Rhizobium sp. NGR234 were sequenced,together with two PCR products and a subcloned DNA fragment derived fromcosmid pXB564 that cover two remaining gaps (position 276,448-277,944and position 480,607-483,991). The map of the sequenced cosmids is shownin FIG. 4. The entire assembled 536 kb sequence of pNGR234a is given inFIG. 3 (deposited in EMBL/GenBank under accession no. U00090).

The analysis of the complete nucleotide sequence revealed few regions of98-100% identity to already published sequences in public databases.These sequences are listed in Table 8. These sequences had been derivedeither from Rhizobium sp. NGR234, derivatives of it or closely relatedstrains of it. Therefore, the ORFs and their deduced proteins, 98-100%homologous to nifH, nodA, nodB, nodC, nodD1, nodS, nodU, nolX, nolW,nolB, nolU and “ORF1”, represent already known genes/proteins (Table 8and References). Some other ORFs and their deduced proteins, nearlyidentical to public database entries, were either only partially knownbefore the disclosure of the present invention or exhibited significantdifferences, for instance, dctA, host-inducible gene A, nifD, nifK,nodD2, nolT, nolX, nolV, “ORF140”, “ORF91”, “RSRS9 25 kDa-protein gene”(Table 8 and References).

As a first step, approximately 100 kb of pNGR234a was analyzed betweenposition 417,796 to 517,279 using the programs TESTCODE (Fickett, 1982)and CODONPREFERENCE (Gribskov et al., 1984). In this initial ˜100 kb ofsequence, 76 ORFs were found and ascribed putative functions

(=ORFs y4tQ to y4yO (excluding ORFs y4uD, y4uG, y4wG, y4wO, y4wP, y4xF,y4xQ, y4xG and y4yB and excluding ORF-fragments fu1, fu2, fu3, fu4, fv1and fw1); see Table 3). It should be noted that since the sequence ofcosmid pXB296 forms part of this 100 kb region, all of the ORFsidentified in Table 2 (except “ORF1”) are reproduced (albeit with minor,but definitive, revisions) in Table 3. Most of the 76 ORFs and theirdeduced proteins showed homologies to public database entries that couldhelp identify their putative functions. Only ORFs y4vK and y4xA(duplicated nifH) as well as y4yD, y4yE and y4yG (nolW, nolB and nolU)were identical to database entries (98-100% homology). In the case of 7ORFs and their deduced proteins, no homologous sequences in publicdatabases have been found.

TABLE 8 All ORFs that show 98-100% identity in the nucleotide sequenceto ORFs located in pNGR234a and that have already been published indatabases: EMBL/GeneBank + claimed in the patent application/ ORForganism accession no. − not claimed in the patent application dctARhizobium sp. NGR234 S38912 + sequencing mistakes in the database entry:the real dctA in pNGR234a is 144 bases longer (see table 4) hostinducible geneA Rhizobium fredii USDA 201# M19019 RFIND + significantdifference in pNGR234a (frameshift; see table 4) nifH Rhizobium sp. ANU240* M26961 RHMNIFKDH3 − nifD (partially) Rhizobium sp. ANU 240* M26961RHMNIFKDH2 + only part of nifD is in the public database nifK(partially) Rhizobium sp. ANU 240* M26961 RHMNIFKDH1 + only part of nifKis in the public database nodABC Rhizobium fredii USDA 257# M73362RSNOD2 − nodD1 Rhizobium sp. mpik 3030* Y00059 RSNODD1 − nodD2 Rhizobiumjaponicum US6A 191# M18972 RHMNODD2M + significantly different functionof NodD2 in NGR234 than in USDA 191 (despite of 98% identity°) nodSRhizobium sp. NGR234 J03686 NGRNOIDSU − nodU (partially) Rhizobium sp.NGR234 J03686 NGRNODSU − nodU (full) Rhizobium sp.* X89965 RSNODUGENnolXWBTUV Rhizobium fredii USDA 257# L12251 RHMNOLBTU − nolXWB, nolU +NolT: 97% identical (amino acid sequence level) + NolX, NolV + ORF4 ofpNGR234a show significant differences to USDA257 (see table 4) ORF1;ORF2 (partially) Rhizobium sp. NGR234 X74314 RSORF − ORF140 nodulationgene; Rhizobium sp. NGR234 X74068 RSPLAS + database entry includessequencing mistakes ORF91 (partially) causing frameshifts RFRS9 25 kDaprotein gene* Rhizobium fredii USDA 257# U18764 RFU18764 + repetitiveelement in pNGR234a showing insertions, deletions of nucleotides incomparison to the database entry *strains representing derivatives ofNGR234: Rhizobium sp. ANU 240, Rhizobium sp. mpik 3030, Rhizobium sp.#strains closely related to NGR234: Rhizobium fredii USDA 257, Rhizobiumjaponicum USDA 191, Rhizobium fredii USDA 201. °identity in nucleotidesequence as well as amino acid sequence

As a second step, the remaining 436 kb of pNGR234a were analyzed usingthe methods noted above. The results of this analysis are discussed inExample 6.

Example 6

Genetic Organization of the Complete Plasmid pNGR234a

In order to confirm and to improve the identification of probable codingregions in pNGR234a, the program GeneMark was used which is based onmatrices developed for related organisms of Rhizobium sp. NGR234 (R.leguminosarum and R. meliloti (Borodovsky et al., 1994)). The use ofthis program currently represents the most frequently applied method todistinguish coding and non-coding regions in newly sequences DNA ofprokaryotes. Further analysis of the putative ORF products was carriedout using methods to detect signal sequences, transmembrane segments andvarious other domains (PROSITE database search (Bairoch et al., 1995);PSORT program (Nakai et al., 1991)).

In total, 416 ORFs were predicted to encode putative proteins (Freiberget al., 1997). Additionally, 67 fragments were detected that seemed tobe remnants of functional ORFs. Some of these were disrupted byinsertion of mobile elements. All identified functional ORFs andfragments of former functional ORFs are listed in Table 3.

Within the initial ˜100 kb region (position 417,796 to 517,279) firstanalyzed in this study, 9 ORFs (y4uD, y4uG, y4wG, y4wO, y4wP, y4xF,y4xQ, y4xG and y4yB) and 6 ORF-fragments (fu1, fu2, fu3, fu4, fv1 andfw1) were predicted in addition to the 76 ORFs (y4tQ to y4yO) listedwithin Table 3.

According to Table 8, 12 ORFs of the 416 predicted coding regions wereidentical to public database entries (98% to 100% homology at the aminoacid level), namely: y4hI (nodA), y4hH (nodB), y4hG (nodC), y4aL(nodD1), y4nC (nodS), y4nB (nodU), y4sM (ORF1), y4vK (nifH1), y4xA(nifH2), y4yD (nolW), y4yE (nolB), y4yG (nolU). In addition, thedatabase entry of the homologue to y4yC (nolX) has been corrected to 98%identical to y4yC. Furthermore, the sequence of the ORF y4hB (noeE) hasbeen available to the public since October 1996. Except the 14 ORFsmentioned above, the remaining 402 ORFs are new. 139 of them show nohomology to any known ORF/protein. The others exhibit less than 98%amino acid identity to public database entries over their whole length.

Industrial Applicability

The present invention provides a detailed analysis of the symbioticplasmid pNGR234a of Rhizobium sp. NGR234. The plasmid pNGR234a(including any ORFs encoded therein, or any part of the nucleotidesequence of the plasmid, or any proteins expressible from any of saidORFs or any part of said nucleotide sequence) has industrialapplicability which can include its use in, inter alia, the followingareas:

(a) the analysis of the structure, organisation or dynamics of othergenomes;

(b) the screening, subcloning, or amplification by PCR of nucleotidesequences;

(c) gene trapping;

(d) the identification and classification of organisms and their geneticinformation;

(e) the identification and characterisation of nucleotide sequences,amino acid sequences or proteins;

(f) the transportation of compounds to and from an organism which ishost to at least to one of said nucleotide sequences, ORFs or proteins;

(g) the degradation and/or metabolism of organic, inorganic, natural orxenobiotic substances in a host organism;

(h) the modification of the host-range, nitrogen fixation abilities,fitness or competitiveness of organisms;

(i) obtaining a synthetic minimal set of ORFs required for functionalRhizobium-legume symbiosis;

(j) the modification of the host-range of rhizobia;

(k) the augmentation of the fitness or competitiveness of Rhizobium sp.NGR234 in the soil and its nodulation efficiency on host plants;

(l) the introduction of desired phenotype(s) into host plants using saidplasmid as a stable shuttle system for foreign DNA encoding said desiredphenotype(s); or

(m) the direct transfer of said plasmid into rhizobia or othermicroorganisms without using other vectors for mobilization.

REFERENCES

Altschul, S. F., G. Warren, W. Miller, E. M. Myers, and D. J. Lipman.1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410.

Appelbaum, E. R., D. V. Thompson, K. Idler and N. Chartrain. 1988.Rhizobium japonicum USDA1 191 has two nodD genes that differ in primarystructure and function. J. Bacteriol. 170: 12-20.

Badenoch-Jones, J., T. A. Holton, C. M. Morrison, K. F. Scott and J.Shine. 1989. Structural and functional analysis of nitrogenase genesfrom the broad host-range Rhizobium strain ANU240. Gene 77: 141-153.

Bender, G. L., M. Nayudu, K. K. L. Strange and B. G. Rolfe. 1988. ThenodD1 gene from Rhizobium strain NGR234 is a key determinant in theextension of host-range to the non-legume Parasponia. Mol. Plant-MicrobeInteract. 1: 259.

Bodmer, W. F. 1994. The Human Genome Project. Rev. Invest. Clin.(Suppl.) 3-5.

Broughton, W. J., M. J. Dilworth, and I. K. Passmore. 1972. Base ratiodetermination using unpurified DNA. Anal. Biochem. 46: 164-172.

Broughton, W. J., N. Heycke, H. Meyer z. A., and C. E. Pankhurst. 1984.Plasmid-linked nif and “nod” genes in fast growing rhizobia thatnodulate Glycine max, Psophocarpus tetragonolobus, and Vignaunguiculata. Proc. Natl. Acad. Sci. USA. 81: 3093-3097.

Broughton, W. J. C-H. Wong, A. Lewin, U. Samrey, H. Myint, H. Meyer z.A., D. N. Dowling, and R. Simon. 1986. Identification of Rhizobiumplasmid sequences involved in recognition of Psophocarpus, Vigna, andother legumes. J. Cell Biol. 102: 1173-1182.

Buikema, W. J., W. W. Szeto, P. V Lemley, W. H. Orme-Johnson, and F. M.Ausubel. 1985. Nitrogen fixation specific regulatory genes of Klebsiellapneumoniae and Rhizobium meliloti share homology with the generalnitrogen regulatory gene ntrC of K. pneumoniae. Nucleic Acids Res. 13:4539-4555.

Cami, B. and P. Kourilsky. 1978. Screening of cloned recombinant DNA inbacteria by in situ colony hybridization. Nucleic Acids Res. 5:2381-2390.

Craxton, M. 1993. Cosmid sequencing. Methods Mol. Biol. 23: 149-167.

Dear, S. and R. Staden. 1991. A sequence assembly and editing forefficient management of large projects. Nucleic Acids Res. 19:3907-3911.

Davis, E. O. and A. W. B. Johnston. 1990. Regulatory functions of the 3nodD genes of Rhizobium leguminosarum bv. phaseoli. Mol. Microbiol. 4:933-941.

Dower, W. J., J. F. Miller, and C. W. Ragsdale. 1988. High efficiencytransformation of E. coli by high voltage electroporation. Nucleic AcidsRes. 16: 6127-6145.

Fellay, R., P. Rochepeau, B. Relić, and W. J. Broughton. 1995. Signalsto and emanating from Rhizobium largely control symbiotic specificity.In Pathogenesis and host specificity in plant diseases.Histopathological, biochemical, genetic, and molecular bases (ed. U. S.Singh, R. P. Singh, and K. Kohmoto), Vol. I, pp. 199-220.Pergamon/Elsevier Science Ltd., Oxford, U.K.

Fickett, J. W. 1982. Recognition of protein coding regions in DNAsequences. Nucleic Acids Res. 10: 5303-5318.

Fischer, H.-M. 1994. Genetic regulation of nitrogen fixation inRhizobia. Microbiol. Rev. 58: 352-386.

Fisher, R. F. and S. R. Long. 1993. Interactions of NodD at the nod box:NodD binds to two distinct sites on the same face of the helix andinduces a bend in the DNA. J. Mol. Biol. 233: 336-348.

Fleischmann, R. D., M. D. Adams, O. White, R. A. Clayton, E. F.Kirkness, A. R. Kerlavage, C. J. Bult, J. F. Tomb, B. A. Dougherty, J.M. Merrick, et al. 1995. Whole-genome random sequencing and assembly ofHaemophilus influenzae Rd. Science 269: 496-512.

Fraser, C. M., J. D. Gocayne, O. White, M. D. Adams, R. A. Clayton, R.D. Fleischmann, C. J. Bult, A. R. Kerlavage, G. Sutton, J. M. Kelley, etal. 1995. The minimal gene complement of Mycoplasma genitalium. Science270: 397-403.

Freiberg, C., X. Perret, W. J. Broughton and A. Rosenthal. 1996.Sequencing the 500-kb GC-rich symbiotic replicon of Rhizobium sp. NGR234using dye terminators and a thermostable sequenase: A beginning. GenomeResearch, in press.

Gribskov, M., J. Devereux, and R. R. Burgess. 1984. The codonpreferenceplot: Graphic analysis of protein coding sequences and prediction ofgene expression. Nucleic Acids Res. 12: 539-549.

Hanahan, D. 1983. Studies on transformation of Escherichia coli withplasmids. J. Mol. Biol. 166: 557-580.

Hartl, D. L. and M. J. Palazzolo. 1993. Drosophila as a model organismin genome analysis. In Genome research in molecular medicine andvirology (ed. K. W. Adolf), pp. 115-129. Academic Press, Orlando, Fla.,U.S.A.

Hiles, I. D., M. P. Gallagher, D. J. Jamieson, and C. F. Higgins, 1987.Molecular characterization of the oligopeptide permease of Salmonellatyphimurium. J. Mol. Biol. 195: 125-142.

Iismaa, S. E., P. M. Ealing, K. F. Scott, and J. M. Watson. 1989.Molecular linkage of the nif/fix and nod gene regions in Rhizobiumleguminosarum biovar trifolii. Mol. Microbiol. 3: 1753-1764.

Levy, J. 1994. Sequencing the yeast genome: An internationalachievement. Yeast 10: 1689-1706.

Lewin, A., E. Cervantes, C.-H. Wong and W. J. Broughton. 1990. nodSU,two new nod genes of the broad host range Rhizobium strain NGR234 encodehost-specific nodulation of the tropical tree Leucaena leucocephala.Mol. Plant Microbe Interact. 3: 317-326.

Long, S. R. 1989. Rhizobium-legume nodulation: life together in theunderground. Cell 56: 203-214.

Long, S., J. W. Reed, J. Himawan and G. C. Walker. 1988. Geneticanalysis of a cluster of genes required for synthesis of thecalcofluor-binding exopolysaccharide of Rhizobium meliloti. J.Bacteriol. 170: 4239-4248.

Makino, S.-I., I. Uchida, N. Terakado, C. Sasakawa, and M. Yoshikawa.1989. Molecular characterization and protein analysis of the cap region,which is essential for encapsulation in Bacillus anthracis. J. Bacteriol171: 722-730.

Martinez, E., D. Romero, and R. Palacios. 1990. The Rhizobium genome.Crit. Rev. Plant Sci. 9: 59-93.

Morett, E. and M. Buck. 1988. NifA-dependent in vivo protectiondemonstrates that the upstream activator sequence of nif promoters is aprotein binding site. Proc. Natl. Acad. Sci. USA. 85: 9401-9405.

Morett, E. and M. Buck. 1989. In vivo studies on the interaction of RNApolymerase-σ⁵⁴ with the Klebsiella pneumoniae and Rhizobium nelilotinifH promoters: The role of nifA in the formation of an open promotercomplex. J. Mol. Biol. 210: 65-77.

Padmanabhan, S., R.-D. Hirtz, and W. J. Broughton. 1990. Rhizobia intropical legumes: Cultural characteristics of Bradyrhizobium andRhizobium sp. Soil Biol. Biochem. 22: 23-28.

Pearson, W. R. and D. J. Lipman. 1988. Improved tools for biologicalsequence comparison. Proc. Natl. Acad. Sci. 85: 2444-2448.

Perego, M., C. F. Higgins, S. R. Pearce, M. P. Gallagher, and J. A.Hoch. 1991. The oligopeptide transport system of Bacillus subtilis playsa role in the initiation of sporulation. Mol. Microbiol. 5: 173-185.

Perret, X., W. J. Broughton, and S. Brenner. 1991. Canonical orderedcosmid library of the symbiotic plasmid of Rhizobium species NGR234.Proc. Natl. Acad. Sci. USA. 88: 1923-1927.

Perret, X., R. Fellay, A. J. Bjourson, J. E. Cooper, S. Brenner, and W.J. Broughton. 1994. Subtraction hybridization and shotgun sequencing: Anew approach to identify symbiotic loci. Nuclei Acids Res. 22:1335-1341.

Platt, T. 1986. Transcription termination and regulation of geneexpression. Annu. Rev. Biochem. 55: 339-372.

Radloff, R., W. Bauer, and J. Vinograd. 1967. A dye-buoyant-densitymethod for the detection and isolation of closed circular duplex DNA:The closed circular DNA in HELA cells. Proc. Natl. Acad. Sci. USA. 57:1514-1521.

Relić, B., X. Perret, M. T. Estrada-García, J. Kopcinska, W. Golinowski,H. B. Krishnan, S. G. Pueppke and W. J. Broughton. 1994. Nod factors ofRhizobium are a key to the legume door. Mol. Microbiol. 13: 171-178.

Rosenthal, A. and D. S. Charnock-Jones. 1993. Linear amplificationsequencing with dye terminators. Methods Mol. Biol. 23: 281-296.

Sambrook, J., E. F. Fritsch, and T. Maniatis. 1989. Molecular cloning: Alaboratory manual, 2nd ed., Cold Spring Harbor Laboratory Press, ColdSpring Harbor, N.Y., U.S.A.

Shine J. and L. Dalgarno. 1974. The 3′-terminal sequence of Escherichiacoli 16S ribosomal RNA: Complementary to nonsense triplets and ribosomebinding sites. Proc Natl. Acad. Sci. 71: 1342-1346.

Smith, T. F. and M. S. Waterman. 1981. Identification of commonmolecular subsequences. J. Mol. Biol. 147: 195-197.

Stanfield, S., L. Ielpi, D. O'Brochta, D. R. Hesinki and G. S. Ditta.1988. The ndvA gene product of Rhizobium meliloti is required forBeta(1-2)glucan production and has homology to the ATP binding exportprotein HlyB. J. Bacteriol. 170: 3523-3530.

Sulston, J, Z. Du, K. Thomas, R. Wilson, L. Hillier, R. Staden, N.Halloran, P. Green, J. Thierry-Mieg, L. Qiu, et al. 1992. The C. elegansgenome sequencing project: A beginning. Nature 356: 37-41.

Tabor, S. and C. C. Richardson. 1995. A single residue in DNApolymerases of the Escherichia coli DNA polymerase I family is criticalfor distinguishing between deoxy- and dideoxyribonucleotides. Proc.Natl. Acad. Sci. 92: 6339-6343.

van Rhijn, P. and J. Vanderleyden. 1995. The Rhizobium-plant symbiosis.Microbiol. Rev. 59: 124-142.

van Slooten, J. C., T. V. Bhuvanasvari, S. Bardin, and J. Stanley. 1992.Two C₄-dicarboxylate transport systems in Rhizobium sp. NGR234:Rhizobial dicarboxylate transport is essential for nitrogen fixation intropical legume symbioses. Mol. Plant Microbe Interact. 5: 179-186.

Yanisch-Perron, C., J. Ira, and J. Messing. 1985. Improved M13 phagecloning vectors and host strains: Nucleotide sequences of M13mp18 andpUC19 vectors. Gene 33: 103-119.

Bairoch A., P. Bucher, and K. Hofmann. 1995. The prosite database, itsstatus in 1995. Nucleic Acids Res., 24 189.

Borodovsky, M. Y., K. E. Rudd and E. V. Koonin. 1994. Intrinsic andextrinsic approaches for detecting genes in a bacterial genome NucleicAcids Res. 22: 4756.

Broughton, W. J., U. Samrey, and J. Stanley. 1987. Ecological geneticsof Rhizobium meliloti: symbiotic plasmid transfer in the Medicago sativarhizosphere FEMS Microbiol. Lett. 40: 251.

Fellay, R., X. Perret, V. Viprey, W. J. Broughton, and S. Brenner.1995a. Organization of host-inducible transcripts on the symbioticplasmid of Rhizobium sp. NGR234Mol. Microbiol. 16: 657.

Freiberg, C., R. Fellay, A. Bairoch, W. J. Broughton, A. Rosenthal, andX. Perret. 1997. Molecular basis of symbiosis between Rhizobium andlegumes. Nature, 387: 394-401.

Gray, J. X., M. A. Djordjevic, and B. G. Rolfe. 1990. Two genes thatregulate exopolysaccharide production in Rhizobium sp. strain NGR234:DNA sequences and resultant phenotypesJ. Bacteriol. 172: 195.

Hanin, M., S. Jabbouri, D. Quesada-Vincens, C. Freiberg, X. Perret,J.-C. Prome, W. J. Broughton, and R. Fellay. 1996. Sulphatation ofRhizobium sp. NGR234 Nod factors is dependent on noeE, a newhost-specificity gene Mol. Microbiol., in press.

Krishnan, H. B., C.-I. Kuo, and S. G. Pueppke. 1995. Elaboration offlavonoid-induced proteins by the nitrogen-fixing soybean symbiontRhizobium fredii is regulated by both nodD1 and nodD2, and is dependenton the cultivar-specificity locus, nolXWBTUV Microbiology. 141: 2245.

Morrison, N. A., C. Y. Hau, M. J. Trinick, J. Shine and B. G. Rolfe.1983. Heat curing of a sym plasmid in a fast-growing Rhizobium sp. thatis able to nodulate legumes and the nonlegume Parasponia sp. J.Bacteriol. 153: 427.

Nakai, K. and M. Kanehisa. 1992. Expert system for predicting proteinlocalization sites in Gram-negative bacteria. PROTEINS: STructure,Functions, and Genetics 11: 95-110.

Piper, K. R., S. Beck von Bodman, and S. K. Farrand. 1993. Conjugationfactor of Agrobacterium tumefaciens regulates Ti plasmid transfer byautoinduction Nature 362: 448.

Sullivan, J. T., H. N. Patrick, W. L. Lowther, D. B. Scott, and C. W.Ronson. 1995. Nodulating strains of Rhizobium loti arise throughchromosomal symbiotic gene transfer in the environment Proc. Natl. Acad.Sci., 92: 8985.

van Slooten, J. C., E. Cervantes, W. J. Broughton, C.-H. Wong, and J.Stanley. 1990. Sequence and analysis of the rpoN sigma factor gene ofRhizobium sp. strain NGR234 J. Bacteriol. 172: 5563.

van Slooten, J. C., T. V. Bhuvanaswari, S. Bardin, and J. Stanley. 1992.Two C4-dicarboxylate transport systems in Rhizobium sp. NGR234:rhizobial dicarboxylate transport is essential for nitrogen fixation intropical legume symbioses Mol. Plant-Microbe Interact. 5: 179.

Zhang, L-H., P. J. Murphy, A. Kerr, and M. E. Tate. 1993. Agrobacteriumconjugation and gene regulation by N-acyl-L-homoserine lactones Nature362: 446.

SEQUENCE LISTING The patent contains a lengthy “Sequence Listing”section. A copy of the “Sequence Listing” is available in electronicform from the USPTO web site(http://seqdata.uspto.gov/sequence.html?DocID=06475793B1). An electroniccopy of the “Sequence Listing” will also be available from the USPTOupon request and payment of the fee set forth in 37 CFR 1.19(b)(3).

What is claimed is:
 1. An isolated polynucleotide open reading frame(ORF) y4gA to y4gN derived from the polynucleotide sequence of SEQ IDNO: 1 at nucleotide base numbers 142,026 to 143,234 and degeneratevariants thereof.
 2. The ORF of claim 1 which is under the control ofits natural regulatory elements.
 3. A plasmid which harbours the ORF ofclaim 1 or any degenerate variant thereof.
 4. The plasmid of claim 3produced recombinantly.
 5. A method for transforming a microorganism,comprising the step of transforming the microorganism with the plasmidof claim
 3. 6. A method for transforming a plant, comprising the step oftransforming the plant with a shuttle vector comprising the plasmid ofclaim
 3. 7. A method for transforming a plant, comprising the step oftransforming the plant with the plasmid of claim
 3. 8. A transgenicplant, transformed with the ORF of claim
 1. 9. A transgenicmicroorganism, transformed with the ORF of claim 1.